Web Crawler Script
How to Build a Basic Web Crawler to Pull Information From a …
Programs that read information from websites, or web crawlers, have all kinds of useful applications. You can scrape for stock information, sports scores, text from a Twitter account, or pull prices from shopping websites.
Writing these web crawling programs is easier than you might think. Python has a great library for writing scripts that extract information from websites. Let’s look at how to create a web crawler using Scrapy.
Installing Scrapy
Scrapy is a Python library that was created to scrape the web and build web crawlers. It is fast, simple, and can navigate through multiple web pages without much effort.
Scrapy is available through the Pip Installs Python (PIP) library, here’s a refresher on how to install PIP on Windows, Mac, and Linux.
Using a Python Virtual Environment is preferred because it will allow you to install Scrapy in a virtual directory that leaves your system files alone. Scrapy’s documentation recommends doing this to get the best results.
Create a directory and initialize a virtual environment.
mkdir crawlercd crawlervirtualenv venv. venv/bin/activate
You can now install Scrapy into that directory using a PIP command.
pip install scrapy
A quick check to make sure Scrapy is installed properly
scrapy# printsScrapy 1. 4. 0 – no active projectUsage: scrapy
How to Build a Web Crawler
Now that the environment is ready you can start building the web crawler. Let’s scrape some information from a Wikipedia page on batteries: (electricity).
The first step to write a crawler is defining a Python class that extends from This gives you access to all the functions and features in Scrapy. Let’s call this class spider1.
A spider class needs a few pieces of information:
a name for identifying the spider
a start_urls variable containing a list of URLs to crawl from (the Wikipedia URL will be the example in this tutorial)
a parse() method which is used to process the webpage to extract information
import scrapyclass spider1(): name = ‘Wikipedia’ start_urls = [‘(electricity)’] def parse(self, response): pass
A quick test to make sure everything is running properly.
scrapy runspider prints2017-11-23 09:09:21 [] INFO: Scrapy 1. 0 started (bot: scrapybot)2017-11-23 09:09:21 [] INFO: Overridden settings: {‘SPIDER_LOADER_WARN_ONLY’: True}2017-11-23 09:09:21 [scrapy. middleware] INFO: Enabled extensions:[‘moryUsage’, ‘scrapy. extensions. logstats. LogStats’,…
Turning Off Logging
Running Scrapy with this class prints log information that won’t help you right now. Let’s make it simple by removing this excess log information. Use a warning statement by adding code to the beginning of the file.
import tLogger(‘scrapy’). setLevel(logging. WARNING)
Now when you run the script again, the log information will not print.
Using the Chrome Inspector
Everything on a web page is stored in HTML elements. The elements are arranged in the Document Object Model (DOM). Understanding the DOM is critical to getting the most out of your web crawler. A web crawler searches through all of the HTML elements on a page to find information, so knowing how they’re arranged is important.
Google Chrome has tools that help you find HTML elements faster. You can locate the HTML for any element you see on the web page using the inspector.
Navigate to a page in Chrome
Place the mouse on the element you would like to view
Right-click and select Inspect from the menu
These steps will open the developer console with the Elements tab selected. At the bottom of the console, you will see a tree of elements. This tree is how you will get information for your script.
Let’s get the script to do some work for us; A simple crawl to get the title text of the web page.
Start the script by adding some code to the parse() method that extracts the title…. def parse(self, response): print (‘h1#firstHeading::text’). extract()…
The response argument supports a method called CSS() that selects elements from the page using the location you provide.
In this example, the element is rstHeading. Adding::text
to the script is what gives you the text content of the element. Finally, the extract() method returns the selected element.
Running this script in Scrapy prints the title in text form.
[u’Battery (electricity)’]
Finding the Description
Now that we’ve scraped the title text let’s do more with the script. The crawler is going to find the first paragraph after the title and extract this information.
Here’s the element tree in the Chrome Developer Console:
div#mw-content-text>div>p
The right arrow (>) indicates a parent-child relationship between the elements.
This location will return all of the p elements matched, which includes the entire description. To get the first p element you can write this code:
(‘div#mw-content-text>div>p’)[0]
Just like the title, you add CSS extractor::text
to get the text content of the element.
(‘div#mw-content-text>div>p’)[0](‘::text’)
The final expression uses extract() to return the list. You can use the Python join() function to join the list once all the crawling is complete.
def parse(self, response): print ”((‘div#mw-content-text>div>p’)[0](‘::text’). extract())
The result is the first paragraph of the text!
An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars. [1] When a battery is supplying electric power, its positive terminal is…
Collecting JSON Data
Scrapy can extract information in text form, which is useful. Scrapy also lets you view the data JavaScript Object Notation (JSON). JSON is a neat way to organize information and is widely used in web development. JSON works pretty nicely with Python as well.
When you need to collect data as JSON, you can use the yield statement built into Scrapy.
Here’s a new version of the script using a yield statement. Instead of getting the first p element in text format, this will grab all of the p elements and organize it in JSON format…. def parse(self, response): for e in (‘div#mw-content-text>div>p’): yield { ‘para’: ”((‘::text’). extract())()}…
You can now run the spider by specifying an output JSON file:
scrapy runspider -o
The script will now print all of the p elements.
[{“para”: “An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars. [1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode. [2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work. [3] Historically the term \”battery\”” specifically referred to a device composed of multiple cells
however the usage has evolved additionally to include devices composed of a single cell. [4]””}
{“”para””: “”Primary (single-use or \””disposable\””) batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple…