…
Scrapy Python Web Scraping & Crawling For Beginners
Scrapy : Python Web Scraping & Crawling for Beginners
When I was a kid I saw this YouTube video on how to make a folder invisible on Windows. I have never looked back since then. My love for technology has only grown. I started with security since that was one of the areas that fascinated me. Then i went on to win the award for designing using Photoshop at Cofas ‘2012. On the destructive side, I always made scripts that used to mess up the systems at my school. I was almost suspended. I learned my lesson and vowed to do only constructive things. To make people aware of security issues, along with 2 friends started a Facebook page and group called YAPTo make things more interesting Web development came into my life which helped me get into the most prestigious chapter at my college IEEE. This helped me master different things that got my attention- Android Development- Augmented Reality- Machine learning – Python Development- Internet of things ( IOT)I never really wanted to go to a college. Still against it but it made me realize that the joy of creating something with a team of people is unparalleled. I created a blog which helps people who are not so familiar with technology get familiar with it and benefit from it. It has more 5, 00, 000 views I learned and am still learning from it are Writing, WordPress CMS, SEO, Google Analytics and Adsense and how to market a product after creating it.
Crawling and Scraping Web Pages with Scrapy and … – DigitalOcean
Introduction
Web scraping, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data, ” is a powerful tool for working with data on the web.
With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.
In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use BrickSet, a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.
Prerequisites
To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
Step 1 — Creating a Basic Scraper
Scraping is a two step process:
You systematically find and download web pages.
You take those web pages and extract information from them.
Both of those steps can be implemented in a number of ways in many languages.
You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.
You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.
Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time. It makes scraping a quick and fun process!
Scrapy, like most Python packages, is on PyPI (also known as pip). PyPI, the Python Package Index, is a community-owned repository of all published Python software.
If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip installed on your machine, so you can install Scrapy with the following command:
pip install scrapy
If you run into any issues with the installation, or you want to install Scrapy without using pip, check out the official installation docs.
With Scrapy installed, let’s create a new folder for our project. You can do this in the terminal by running:
mkdir brickset-scraper
Now, navigate into the new directory you just created:
cd brickset-scraper
Then create a new Python file for our scraper called We’ll place all of our code in this file for this tutorial. You can create this file in the terminal with the touch command, like this:
touch
Or you can create the file using your text editor or graphical file manager.
We’ll start by making a very basic scraper that uses Scrapy as its foundation. To do that, we’ll create a Python class that subclasses, a basic spider class provided by Scrapy. This class will have two required attributes:
name — just a name for the spider.
start_urls — a list of URLs that you start to crawl from. We’ll start with one URL.
Open the file in your text editor and add this code to create the basic spider:
import scrapy
class BrickSetSpider():
name = “brickset_spider”
start_urls = [”]
Let’s break this down line by line:
First, we import scrapy so that we can use the classes that the package provides.
Next, we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. Think of a subclass as a more specialized form of its parent class. The Spider subclass has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.
Then we give the spider the name brickset_spider.
Finally, we give our scraper a single URL to start from:. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages containing LEGO sets.
Now let’s test out the scraper. You typically run Python files by running a command like python path/to/ However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:
scrapy runspider
You’ll see something like this:
Output2016-09-22 23:37:45 [scrapy] INFO: Scrapy 1. 1. 2 started (bot: scrapybot)
2016-09-22 23:37:45 [scrapy] INFO: Overridden settings: {}
2016-09-22 23:37:45 [scrapy] INFO: Enabled extensions:
[‘scrapy. extensions. logstats. LogStats’,
”,
‘reStats’]
2016-09-22 23:37:45 [scrapy] INFO: Enabled downloader middlewares:
[‘tpAuthMiddleware’,…
”]
2016-09-22 23:37:45 [scrapy] INFO: Enabled spider middlewares:
[‘tpErrorMiddleware’,…
2016-09-22 23:37:45 [scrapy] INFO: Enabled item pipelines:
[]
2016-09-22 23:37:45 [scrapy] INFO: Spider opened
2016-09-22 23:37:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-22 23:37:45 [scrapy] DEBUG: Telnet console listening on 127. 0. 1:6023
2016-09-22 23:37:47 [scrapy] DEBUG: Crawled (200)
2016-09-22 23:37:47 [scrapy] INFO: Closing spider (finished)
2016-09-22 23:37:47 [scrapy] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 224,
‘downloader/request_count’: 1,…
‘scheduler/enqueued/memory’: 1,
‘start_time’: time(2016, 9, 23, 6, 37, 45, 995167)}
2016-09-22 23:37:47 [scrapy] INFO: Spider closed (finished)
That’s a lot of output, so let’s break it down.
The scraper initialized and loaded additional components and extensions it needed to handle reading data from URLs.
It used the URL we provided in the start_urls list and grabbed the HTML, just like your web browser would do.
It passed that HTML to the parse method, which doesn’t do anything by default. Since we never wrote our own parse method, the spider just finishes without doing any work.
Now let’s pull some data from the page.
We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.
If you look at the page we want to scrape, you’ll see it has the following structure:
There’s a header that’s present on every page.
There’s some top-level search data, including the number of matches, what we’re searching for, and the breadcrumbs for the site.
Then there are the sets themselves, displayed in what looks like a table or ordered list. Each set has a similar format.
When writing a scraper, it’s a good idea to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with some things removed for readability:
…
Scraping this page is a two step process:
First, grab each LEGO set by looking for the parts of the page that have the data we want.
Then, for each set, grab the data we want from it by pulling the data out of the HTML tags.
scrapy grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy supports either CSS selectors or XPath selectors.
We’ll use CSS selectors for now since CSS is the easier option and a perfect fit for finding all the sets on the page. If you look at the HTML for the page, you’ll see that each set is specified with the class set. Since we’re looking for a class, we’d use for our CSS selector. All we have to do is pass that selector into the response object, like this:
def parse(self, response):
SET_SELECTOR = ”
for brickset in (SET_SELECTOR):
pass
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those sets so we can display it.
Another look at the source of the page we’re parsing tells us that the name of each set is stored within an h1 tag for each set:
Brick Bank
10251-1
The brickset object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
NAME_SELECTOR = ‘h1::text’
yield {
‘name’: (NAME_SELECTOR). extract_first(), }
Note: The trailing comma after extract_first() isn’t a typo. We’re going to add more to this section soon, so we’ve left the comma there to make adding to this section easier later.
You’ll notice two things going on in this code:
We append::text to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of the a tag rather than the tag itself.
We call extract_first() on the object returned by (NAME_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.
Save the file and run the scraper again:
This time you’ll see the names of the sets appear in the output:
Output…
[scrapy] DEBUG: Scraped from <200 >
{‘name’: ‘Brick Bank’}
{‘name’: ‘Volkswagen Beetle’}
{‘name’: ‘Big Ben’}
{‘name’: ‘Winter Holiday Train’}…
Let’s keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigs that come with a set.
Take another look at the HTML for a specific set:
…
The brickset object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
NAME_SELECTOR = ‘h1::text’
yield {
‘name’: (NAME_SELECTOR). extract_first(), }
Note: The trailing comma after extract_first() isn’t a typo. We’re going to add more to this section soon, so we’ve left the comma there to make adding to this section easier later.
You’ll notice two things going on in this code:
We append::text to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of the a tag rather than the tag itself.
We call extract_first() on the object returned by (NAME_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.
Save the file and run the scraper again:
This time you’ll see the names of the sets appear in the output:
Output…
[scrapy] DEBUG: Scraped from <200 >
{‘name’: ‘Brick Bank’}
{‘name’: ‘Volkswagen Beetle’}
{‘name’: ‘Big Ben’}
{‘name’: ‘Winter Holiday Train’}…
Let’s keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigs that come with a set.
Take another look at the HTML for a specific set:
…
10251: Brick Bank
…
We can see a few things by examining this code:
The image for the set is stored in the src attribute of an img tag inside an a tag at the start of the set. We can use another CSS selector to fetch this value just like we did when we grabbed the name of each set.
Getting the number of pieces is a little trickier. There’s a dt tag that contains the text Pieces, and then a dd tag that follows it which contains the actual number of pieces. We’ll use XPath, a query language for traversing XML, to grab this, because it’s too complex to be represented using CSS selectors.
Getting the number of minifigs in a set is similar to getting the number of pieces. There’s a dt tag that contains the text Minifigs, followed by a dd tag right after that with the number.
So, let’s modify the scraper to get this new information:
name = ‘brick_spider’
PIECES_SELECTOR = ‘. //dl[dt/text() = “Pieces”]/dd/a/text()’
MINIFIGS_SELECTOR = ‘. //dl[dt/text() = “Minifigs”]/dd[2]/a/text()’
IMAGE_SELECTOR = ‘img::attr(src)’
‘name’: (NAME_SELECTOR). extract_first(),
‘pieces’: (PIECES_SELECTOR). extract_first(),
‘minifigs’: (MINIFIGS_SELECTOR). extract_first(),
‘image’: (IMAGE_SELECTOR). extract_first(), }
Save your changes and run the scraper again:
Now you’ll see that new data in the program’s output:
Output2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 >
{‘minifigs’: ‘5’, ‘pieces’: ‘2380’, ‘name’: ‘Brick Bank’, ‘image’: ”}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 >
{‘minifigs’: None, ‘pieces’: ‘1167’, ‘name’: ‘Volkswagen Beetle’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: ‘4163’, ‘name’: ‘Big Ben’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: None, ‘name’: ‘Winter Holiday Train’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: None, ‘name’: ‘XL Creative Brick Box’, ‘image’: ‘/assets/images/misc/’}
{‘minifigs’: None, ‘pieces’: ‘583’, ‘name’: ‘Creative Building Set’, ‘image’: ”}
Now let’s turn this scraper into a spider that follows links.
Step 3 — Crawling Multiple Pages
We’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.
You’ll notice that the top and bottom of each page has a little right carat (>) that links to the next page of results. Here’s the HTML for that:
- …
-
Making Web Crawlers Using Scrapy for Python – DataCamp
If you would like an overview of web scraping in Python, take DataCamp’s Web Scraping with Python course.
In this tutorial, you will learn how to use Scrapy which is a Python framework using which you can handle large amounts of data! You will learn Scrapy by building a web scraper for which is an e-commerce website. Let’s get scrapping!
Scrapy Overview
Scrapy Vs. BeautifulSoup
Scrapy Installation
Scrapy Shell
Creating a project and Creating a custom spider
A basic HTML and CSS knowledge will help you understand this tutorial with greater ease and speed. Read this article for a fresher on HTML and CSS.
Source
Web scraping has become an effective way of extracting information from the web for decision making and analysis. It has become an essential part of the data science toolkit. Data scientists should know how to gather data from web pages and store that data in different formats for further analysis.
Any web page you see on the internet can be crawled for information and anything visible on a web page can be extracted [2]. Every web page has its own structure and web elements that because of which you need to write your web crawlers/spiders according to the web page being extracted.
Scrapy provides a powerful framework for extracting the data, processing it and then save it.
Scrapy uses spiders, which are self-contained crawlers that are given a set of instructions [1]. In Scrapy it is easier to build and scale large crawling projects by allowing developers to reuse their code.
In this section, you will have an overview of one of the most popularly used web scraping tool called BeautifulSoup and its comparison to Scrapy.
Scrapy is a Python framework for web scraping that provides a complete package for developers without worrying about maintaining code.
Beautiful Soup is also widely used for web scraping. It is a Python package for parsing HTML and XML documents and extract data from them. It is available for Python 2. 6+ and Python 3.
Here are some differences between them in a nutshell:
Scrapy
BeautifulSoup
Functionality
—
Scrapy is the complete package for downloading web pages, processing them and save it in files and databases
BeautifulSoup is basically an HTML and XML parser and requires additional libraries such as requests, urlib2 to open URLs and store the result [6]
Learning Curve
Scrapy is a powerhouse for web scraping and offers a lot of ways to scrape a web page. It requires more time to learn and understand how Scrapy works but once learned, eases the process of making web crawlers and running them from just one line of command. Becoming an expert in Scrapy might take some practice and time to learn all functionalities.
BeautifulSoup is relatively easy to understand for newbies in programming and can get smaller tasks done in no time
Speed and Load
Scrapy can get big jobs done very easily. It can crawl a group of URLs in no more than a minute depending on the size of the group and does it very smoothly as it uses Twister which works asynchronously (non-blocking) for concurrency.
BeautifulSoup is used for simple scraping jobs with efficiency. It is slower than Scrapy if you do not use multiprocessing.
Extending functionality
Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well. It allows you to manage a lot of variables such as retries, redirection and so on.
If the project does not require much logic, BeautifulSoup is good for the job, but if you require much customization such as proxys, managing cookies, and data pipelines, Scrapy is the best option.
Information: Synchronous means that you have to wait for a job to finish to start a new job while Asynchronous means you can move to another job before the previous job has finished
Here is an interesting DataCamp BeautifulSoup tutorial to learn.
With Python 3. 0 (and onwards) installed, if you are using anaconda, you can use conda to install scrapy. Write the following command in anaconda prompt:
conda install -c conda-forge scrapy
To install anaconda, look at these DataCamp tutorials for Mac and Windows.
Alternatively, you can use Python Package Installer pip. This works for Linux, Mac, and Windows:
pip install scrapy
Scrapy also provides a web-crawling shell called as Scrapy Shell, that developers can use to test their assumptions on a site’s behavior. Let us take a web page for tablets at AliExpress e-commerce website. You can use the Scrapy shell to see what components the web page returns and how you can use them to your requirements.
Open your command line and write the following command:
scrapy shell
If you are using anaconda, you can write the above command at the anaconda prompt as well. Your output on the command line or anaconda prompt will be something like this:
You have to run a crawler on the web page using the fetch command in the Scrapy shell. A crawler or spider goes through a webpage downloading its text and metadata.
fetch()
Note: Always enclose URL in quotes, both single and double quotes work
The output will be as follows:
The crawler returns a response which can be viewed by using the view(response) command on shell:
view(response)
And the web page will be opened in the default browser.
You can view the raw HTML script by using the following command in Scrapy shell:
print()
You will see the script that’s generating the webpage. It is the same content that when you left right-click any blank area on a webpage and click view source or view page source. Since, you need only relevant information from the entire script, using browser developer tools you will inspect the required element. Let us take the following elements:
Tablet name
Tablet price
Number of orders
Name of store
Right-click on the element you want and click inspect like below:
Developer tools of the browser will help you a lot with web scraping. You can see that it is an tag with a class product and the text contains the name of the product:
You can extract this using the element attributes or the css selector like classes. Write the following in the Scrapy shell to extract the product name:
(“. product::text”). extract_first()
The output will be:
extract_first() extract the first element that satisfies the css selector. If you want to extract all the product names use extract():
(“. extract()
Following code will extract price range of the products:
(“”). extract()
Similarly, you can try with a number of orders and the name of the store.
XPath is a query language for selecting nodes in an XML document [7]. You can navigate through an XML document using XPath. Behind the scenes, Scrapy uses Xpath to navigate to HTML document items. The CSS selectors you used above are also converted to XPath, but in many cases, CSS is very easy to use. But you should know how the XPath in Scrapy works.
Go to your Scrapy Shell and write fetch() the same way as before. Try out the following code snippets [3]:
(‘/html’). extract()
This will show you all the code under the tag. / means direct child of the node. If you want to get thetags under the html tag you will write [3]:
(‘/html//div’). extract()
For XPath, you must learn to understand the use of / and // to know how to navigate through child and descendent nodes. Here is a helpful tutorial for XPath Nodes and some examples to try out.
If you want to get alltags, you can do it by drilling down without using the /html [3]:
(“//div”). extract()
You can further filter your nodes that you start from and reach your desired nodes by using attributes and their values. Below is the syntax to use classes and their values.
(“//div[@class=’quote’]/span[@class=’text’]”). extract()
(“//div[@class=’quote’]/span[@class=’text’]/text()”). extract()
Use text() to extract all text inside nodes
Consider the following HTML code:
You want to get the text inside the tag, which is child node ofhaing classes site-notice-container container you can do it as follows:
(‘//div[@class=”site-notice-container container”]/a[@class=”notice-close”]/text()’). extract()
Creating a Scrapy project and Custom Spider
Web scraping can be used to make an aggregator that you can use to compare data. For example, you want to buy a tablet, and you want to compare products and prices together you can crawl your desired pages and store in an excel file. Here you will be scraping for tablets information.
Now, you will create a custom spider for the same page. First, you need to create a Scrapy project in which your code and results will be stored. Write the following command in the command line or anaconda prompt.
scrapy startproject aliexpress
This will create a hidden folder in your default python or anaconda installation. aliexpress will be the name of the folder. You can give any name. You can view the folder contents directly through explorer. Following is the structure of the folder:
file/folder
Purpose
deploy configuration file
aliexpress/
Project’s Python module, you’ll import your code from here
__
Initialization file
project items file
project pipelines file
project settings file
spiders/
a directory where you’ll later put your spiders
Once you have created the project you will change to the newly created directory and write the following command:
[scrapy genspider aliexpress_tablets]()
This creates a template file named in the spiders directory as discussed above. The code in that file is as below:
import scrapy
class AliexpressTabletsSpider():
name = ‘aliexpress_tablets’
allowed_domains = [”]
start_urls = [”]
def parse(self, response):
pass
In the above code you can see name, allowed_domains, sstart_urls and a parse function.
name: Name is the name of the spider. Proper names will help you keep track of all the spider’s you make. Names must be unique as it will be used to run the spider when scrapy crawl name_of_spider is used.
allowed_domains (optional): An optional python list, contains domains that are allowed to get crawled. Request for URLs not in this list will not be crawled. This should include only the domain of the website (Example:) and not the entire URL specified in start_urls otherwise you will get warnings.
start_urls: This requests for the URLs mentioned. A list of URLs where the spider will begin to crawl from, when no particular URLs are specified [4]. So, the first pages downloaded will be those listed here. The subsequent Request will be generated successively from data contained in the start URLs [4].
parse(self, response): This function will be called whenever a URL is crawled successfully. It is also called the callback function. The response (used in Scrapy shell) returned as a result of crawling is passed in this function, and you write the extraction code inside it!
Information: You can use BeautifulSoup inside parse() function of the Scrapy spider to parse the html document.
Note: You can extract data through css selectors using () as discussed in scrapy shell section but also using XPath (XML) that allows you to access child elements. You will see the example of () in the code edited in pass() function.
You will make changes to the file. I have added another URL in start_urls. You can add the extraction logic to the pass() function as below:
# -*- coding: utf-8 -*-
start_urls = [”,
”]
print(“procesing:”)
#Extract data using css selectors
(‘. product::text’). extract()
(”). extract()
#Extract data using xpath
(“//em[@title=’Total Orders’]/text()”). extract()
(“//a[@class=’store $p4pLog’]/text()”). extract()
row_data=zip(product_name, price_range, orders, company_name)
#Making extracted data row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
#key:value
‘page’,
‘product_name’: item[0], #item[0] means product in the list and so on, index tells what value to assign
‘price_range’: item[1],
‘orders’: item[2],
‘company_name’: item[3], }
#yield or give the scraped info to scrapy
yield scraped_info
Information: zip() takes n number of iterables and returns a list of tuples. ith element of the tuple is created using the ith element from each of the iterables. [8]
The yield keyword is used whenever you are defining a generator function. A generator function is just like a normal function except it uses yield keyword instead of return. The yield keyword is used whenever the caller function needs a value and the function containing yield will retain its local state and continue executing where it left off after yielding value to the caller function. Here yield gives the generated dictionary to Scrapy which will process and save it!
Now you can run the spider:
scrapy crawl aliexpress_tablets
You will see a long output at the command line like below:
Exporting data
You will need data to be presented as a CSV or JSON so that you can further use the data for analysis. This section of the tutorial will take you through how you can save CSV and JSON file for this data.
To save a CSV file, open from the project directory and add the following lines:
FEED_FORMAT=”csv”
FEED_URI=””
After saving the, rerun the scrapy crawl aliexpress_tablets in your project directory.
The CSV file will look like:
Note: Everytime you run the spider it will append the file.
FEED_FORMAT [5]: This sets the format you want to store the data. Supported formats are: + JSON
+ CSV
+ JSON Lines
+ XML
FEED_URI [5]: This gives the location of the file. You can store a file on your local file storage or an FTP as well.
Scrapy’s Feed Export can also add a timestamp and the name of spider to your file name, or you can use these to identify a directory in which you want to store. %(time)s: gets replaced by a timestamp when the feed is being created [5]%(name)s: gets replaced by the spider name [5]
For Example:
Store in FTP using one directory per spider [5]:
ftpuser:[email protected]/scraping/feeds/%(name)s/%(time)
The Feed changes you make in will apply to all spiders in the project. You can also set custom settings for a particular spider that will override the settings in the file.
custom_settings={ ‘FEED_URI’: “aliexpress_%(time)”,
‘FEED_FORMAT’: ‘json’}
#yield or give the scraped info to Scrapy
returns the URL of the page from which response is generated. After running the crawler using scrapy crawl aliexpress_tablets you can view the json file:
Following Links
You must have noticed, that there are two links in the start_urls. The second link is the page 2 of the same tablets search results. It will become impractical to add all links. A crawler should be able to crawl by itself through all the pages, and only the starting point should be mentioned in the start_urls.
If a page has subsequent pages, you will see a navigator for it at the end of the page that will allow moving back and forth the pages. In the case you have been implementing in this tutorial, you will see it like this:
Here is the code that you will see:
As you can see that under there is a tag with class class that is the current page you are on, and under that are all tags with links to the next page. Everytime you will have to get the tags after this tag. Here comes a little bit of CSS! In this, you have to get sibling node and not a child node, so you have to make a css selector that tells the crawler to find tags that are after tag with class.
Remember! Each web page has its own structure. You will have to study the structure a little bit on how you can get the desired element. Always try out (SELECTOR) on Scrapy Shell before writing them in code.
Modify your as below:
‘FEED_FORMAT’: ‘csv’}
NEXT_PAGE_SELECTOR = ‘ + a::attr(href)’
next_page = (NEXT_PAGE_SELECTOR). extract_first()
if next_page:
yield quest(
response. urljoin(next_page), )
In the above code:
you first extracted the link of the next page using next_page = (NEXT_PAGE_SELECTOR). extract_first() and then if the variable next_page gets a link and is not empty, it will enter the if body.
response. urljoin(next_page): The parse() method will use this method to build a new url and provide a new request, which will be sent later to the callback. [9]
After receiving the new URL, it will scrape that link executing the for body and again look for the next page. This will continue until it doesn’t get a next page link.
Here you might want to sit back and enjoy your spider scraping all the pages. The above spider will extract from all subsequent pages. That will be a lot of scraping! But your spider will do it! Below you can see the size of the file has reached 1. 1MB.
Scrapy does it for you!
In this tutorial, you have learned about Scrapy, how it compares to BeautifulSoup, Scrapy Shell and how to write your own spiders in Scrapy. Scrapy handles all the heavy load of coding for you, from creating project files and folders till handling duplicate URLs it helps you get heavy-power web scraping in minutes and provides you support for all common data formats that you can further input in other programs. This tutorial will surely help you understand Scrapy and its framework and what you can do with it. To become a master in Scrapy, you will need to go through all the fantastic functionalities it has to provide, but this tutorial has made you capable of scraping groups of web pages in an efficient way.
For further reading, you can refer to Offical Scrapy Docs.
Also, don’t forget to check out DataCamp’s Web Scraping with Python course.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9]Frequently Asked Questions about scrapy python web scraping & crawling for beginners
Is Scrapy good for web scraping?
It is available for Python 2.6+ and Python 3. Scrapy is a powerhouse for web scraping and offers a lot of ways to scrape a web page. It requires more time to learn and understand how Scrapy works but once learned, eases the process of making web crawlers and running them from just one line of command.Jan 11, 2019
Is web scraping the same as web crawling?
Crawling is essentially what search engines do. … The web crawling process usually captures generic information, whereas web scraping hones in on specific data set snippets. Web scraping, also known as web data extraction, is similar to web crawling in that it identifies and locates the target data from web pages.Nov 30, 2020
How do you crawl data from a website in Python?
To extract data using web scraping with python, you need to follow these basic steps:Find the URL that you want to scrape.Inspecting the Page.Find the data you want to extract.Write the code.Run the code and extract the data.Store the data in the required format.Sep 24, 2021
Tags: python web crawler source code python web scraping course scrapy documentation scrapy python web scraping & crawling for beginners download scrapy: powerful web scraping & crawling with python scrapy: powerful web scraping & crawling with python free web crawler python beautifulsoup web crawler python code