• December 22, 2024

Hackernoon Web Scraping

Building a Web Scraper from start to finish | Hacker Noon

Building a Web Scraper from start to finish | Hacker Noon

The basic idea of web scraping is that we are taking existing HTML data, using a web scraper to identify the data, and convert it into a useful format. The prerequisites we’ll talk about next are: Python Basics Python Libraries Storing data as JSON (JavaScript Object Notation) The end stage is to have this data stored as either. or in another useful format, such as. HTML structures, and data formats (JCPN) and other useful is a Web Scraper? A Web Scraper is a program that quite literally scrapes or gathers data off of websites. Take the below hypothetical example, where we might build a web scraper that would go to twitter, and gather the content of the above example, we might use a web scraper to gather data from Twitter. We might limit the gathered data to tweets about a specific topic, or by a specific author. As you might imagine, the data that we gather from a web scraper would largely be decided by the parameters we give the program when we build it. At the bare minimum, each web scraping project would need to have a URL to scrape from. In this case, the URL would be Secondly, a web scraper would need to know which tags to look for the information we want to scrape. In the above example, we can see that we might have a lot of information we wouldn’t want to scrape, such as the header, the logo, navigation links, etc. Most of the actual tweets would probably be in a paragraph tag, or have a specific class or other identifying feature. Knowing how to identify where the information on the page is takes a little research before we build the this point, we could build a scraper that would collect all the tweets on a page. This might be useful. Or, we could further filter the scrape, but specifying that we only want to scrape the tweets if it contains certain content. Still looking at our first example, we might be interested in only collecting tweets that mention a certain word or topic, like “Governor. ” It might be easier to collect a larger group of tweets and parse them later on the back end. Or, we could filter some of the results here are web scrapers useful? We’ve partially answered this question in the first section. Web scraping could be as simple as identifying content from a large page, or multiple pages of information. However, one of the great things about scraping the web, is that it gives us the ability to not only identify useful and relevant information, but allows us to store that information for later use. In the above example, we might want to store the data we’ve collected from tweets so that we could see when tweets were the most frequent, what the most common topics were, or what individuals were mentioned the most prerequisites do we need to build a web scraper? Before we get into the nuts and bolts of how a web scraper works, let’s take a step backward, and talk about where web-scraping fits into the broader ecosystem of web technologies. Take a look at the simple workflow below:The basic idea of web scraping is that we are taking existing HTML data, using a web scraper to identify the data, and convert it into a useful format. The end stage is to have this data stored as either JSON, or in another useful format. As you can see from the diagram, we could use any technology we’d prefer to build the actual web scraper, such as Python, PHP or even Node, just to name a few. For this example, we’ll focus on using Python, and it’s accompanying library, Beautiful Soup. It’s also important to note here, that in order to build a successful web scraper, we’ll need to be at least somewhat familiar with HTML structures, and data formats like make sure that we’re all on the same page, we’ll cover each of these prerequisites in some detail, since it’s important to understand how each technology fits into a web scraping project. The prerequisites we’ll talk about next are:HTML structuresPython BasicsPython LibrariesStoring data as JSON (JavaScript Object Notation)If you’re already familiar with any of these, feel free to skip ahead. 1. HTML Structures1. a. Identifying HTML TagsIf you’re unfamiliar with the structure of HTML, a great place to start is by opening up Chrome developer tools. Other browsers like Firefox and Internet explorer also have developer tools, but for this example, I’ll be using Chrome. If you click on the three vertical dots in the upper right corner of the browser, and then ‘More Tools’ option, and then ‘Developer Tools’, you will see a panel that pops up which looks like the following:We can quickly see how the current HTML site is structured. All of the content as contained in specific ‘tags’. The current heading is in an “

” tag, while most of the paragraphs are in “

” tags. Each of the tags also have other attributes like “class” or “name”. We don’t need to know how to build an HTML site from scratch. In building a web scraper, we only need to know the basic structure of the web, and how to identify specific web elements. Chrome and other browser developer tools allow us to see what tags contain the information we want to scrape, as well as other attributes like “class”, that might help us select only specific ’s look at what a typical HTML structure might look like:This is similar to what we just looked at in the chrome dev tools. Here, we can see that all the elements in the HTML are contained within the opening and closing ‘body’ tags. Every element has also has it’s own opening and closing tag. Elements that are nested or indented in an HTML structure indicate that the element is a child element of it’s container, or parent element. Once we start making our Python web scraper, we can also identify elements that we want to scrape based not only on the tag name, but whether it the element is a child of another element. For example, we can see here that there is a