…
How To Crawl A Website Using Python
Crawling and Scraping Web Pages with Scrapy and Python 3
Introduction
Web scraping, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data, ” is a powerful tool for working with data on the web.
With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.
In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use BrickSet, a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.
Prerequisites
To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
Step 1 — Creating a Basic Scraper
Scraping is a two step process:
You systematically find and download web pages.
You take those web pages and extract information from them.
Both of those steps can be implemented in a number of ways in many languages.
You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.
You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.
Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time. It makes scraping a quick and fun process!
Scrapy, like most Python packages, is on PyPI (also known as pip). PyPI, the Python Package Index, is a community-owned repository of all published Python software.
If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip installed on your machine, so you can install Scrapy with the following command:
pip install scrapy
If you run into any issues with the installation, or you want to install Scrapy without using pip, check out the official installation docs.
With Scrapy installed, let’s create a new folder for our project. You can do this in the terminal by running:
mkdir brickset-scraper
Now, navigate into the new directory you just created:
cd brickset-scraper
Then create a new Python file for our scraper called We’ll place all of our code in this file for this tutorial. You can create this file in the terminal with the touch command, like this:
touch
Or you can create the file using your text editor or graphical file manager.
We’ll start by making a very basic scraper that uses Scrapy as its foundation. To do that, we’ll create a Python class that subclasses, a basic spider class provided by Scrapy. This class will have two required attributes:
name — just a name for the spider.
start_urls — a list of URLs that you start to crawl from. We’ll start with one URL.
Open the file in your text editor and add this code to create the basic spider:
import scrapy
class BrickSetSpider():
name = “brickset_spider”
start_urls = [”]
Let’s break this down line by line:
First, we import scrapy so that we can use the classes that the package provides.
Next, we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. Think of a subclass as a more specialized form of its parent class. The Spider subclass has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.
Then we give the spider the name brickset_spider.
Finally, we give our scraper a single URL to start from:. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages containing LEGO sets.
Now let’s test out the scraper. You typically run Python files by running a command like python path/to/ However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:
scrapy runspider
You’ll see something like this:
Output2016-09-22 23:37:45 [scrapy] INFO: Scrapy 1. 1. 2 started (bot: scrapybot)
2016-09-22 23:37:45 [scrapy] INFO: Overridden settings: {}
2016-09-22 23:37:45 [scrapy] INFO: Enabled extensions:
[‘scrapy. extensions. logstats. LogStats’,
”,
‘reStats’]
2016-09-22 23:37:45 [scrapy] INFO: Enabled downloader middlewares:
[‘tpAuthMiddleware’,…
”]
2016-09-22 23:37:45 [scrapy] INFO: Enabled spider middlewares:
[‘tpErrorMiddleware’,…
2016-09-22 23:37:45 [scrapy] INFO: Enabled item pipelines:
[]
2016-09-22 23:37:45 [scrapy] INFO: Spider opened
2016-09-22 23:37:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-22 23:37:45 [scrapy] DEBUG: Telnet console listening on 127. 0. 1:6023
2016-09-22 23:37:47 [scrapy] DEBUG: Crawled (200)
2016-09-22 23:37:47 [scrapy] INFO: Closing spider (finished)
2016-09-22 23:37:47 [scrapy] INFO: Dumping Scrapy stats:
{‘downloader/request_bytes’: 224,
‘downloader/request_count’: 1,…
‘scheduler/enqueued/memory’: 1,
‘start_time’: time(2016, 9, 23, 6, 37, 45, 995167)}
2016-09-22 23:37:47 [scrapy] INFO: Spider closed (finished)
That’s a lot of output, so let’s break it down.
The scraper initialized and loaded additional components and extensions it needed to handle reading data from URLs.
It used the URL we provided in the start_urls list and grabbed the HTML, just like your web browser would do.
It passed that HTML to the parse method, which doesn’t do anything by default. Since we never wrote our own parse method, the spider just finishes without doing any work.
Now let’s pull some data from the page.
We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.
If you look at the page we want to scrape, you’ll see it has the following structure:
There’s a header that’s present on every page.
There’s some top-level search data, including the number of matches, what we’re searching for, and the breadcrumbs for the site.
Then there are the sets themselves, displayed in what looks like a table or ordered list. Each set has a similar format.
When writing a scraper, it’s a good idea to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with some things removed for readability:
…
Scraping this page is a two step process:
First, grab each LEGO set by looking for the parts of the page that have the data we want.
Then, for each set, grab the data we want from it by pulling the data out of the HTML tags.
scrapy grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy supports either CSS selectors or XPath selectors.
We’ll use CSS selectors for now since CSS is the easier option and a perfect fit for finding all the sets on the page. If you look at the HTML for the page, you’ll see that each set is specified with the class set. Since we’re looking for a class, we’d use for our CSS selector. All we have to do is pass that selector into the response object, like this:
def parse(self, response):
SET_SELECTOR = ”
for brickset in (SET_SELECTOR):
pass
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those sets so we can display it.
Another look at the source of the page we’re parsing tells us that the name of each set is stored within an h1 tag for each set:
Brick Bank
10251-1
The brickset object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
NAME_SELECTOR = ‘h1::text’
yield {
‘name’: (NAME_SELECTOR). extract_first(), }
Note: The trailing comma after extract_first() isn’t a typo. We’re going to add more to this section soon, so we’ve left the comma there to make adding to this section easier later.
You’ll notice two things going on in this code:
We append::text to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of the a tag rather than the tag itself.
We call extract_first() on the object returned by (NAME_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.
Save the file and run the scraper again:
This time you’ll see the names of the sets appear in the output:
Output…
[scrapy] DEBUG: Scraped from <200 >
{‘name’: ‘Brick Bank’}
{‘name’: ‘Volkswagen Beetle’}
{‘name’: ‘Big Ben’}
{‘name’: ‘Winter Holiday Train’}…
Let’s keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigs that come with a set.
Take another look at the HTML for a specific set:
…
The brickset object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
NAME_SELECTOR = ‘h1::text’
yield {
‘name’: (NAME_SELECTOR). extract_first(), }
Note: The trailing comma after extract_first() isn’t a typo. We’re going to add more to this section soon, so we’ve left the comma there to make adding to this section easier later.
You’ll notice two things going on in this code:
We append::text to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of the a tag rather than the tag itself.
We call extract_first() on the object returned by (NAME_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.
Save the file and run the scraper again:
This time you’ll see the names of the sets appear in the output:
Output…
[scrapy] DEBUG: Scraped from <200 >
{‘name’: ‘Brick Bank’}
{‘name’: ‘Volkswagen Beetle’}
{‘name’: ‘Big Ben’}
{‘name’: ‘Winter Holiday Train’}…
Let’s keep expanding on this by adding new selectors for images, pieces, and miniature figures, or minifigs that come with a set.
Take another look at the HTML for a specific set:
…
10251: Brick Bank
…
We can see a few things by examining this code:
The image for the set is stored in the src attribute of an img tag inside an a tag at the start of the set. We can use another CSS selector to fetch this value just like we did when we grabbed the name of each set.
Getting the number of pieces is a little trickier. There’s a dt tag that contains the text Pieces, and then a dd tag that follows it which contains the actual number of pieces. We’ll use XPath, a query language for traversing XML, to grab this, because it’s too complex to be represented using CSS selectors.
Getting the number of minifigs in a set is similar to getting the number of pieces. There’s a dt tag that contains the text Minifigs, followed by a dd tag right after that with the number.
So, let’s modify the scraper to get this new information:
name = ‘brick_spider’
PIECES_SELECTOR = ‘. //dl[dt/text() = “Pieces”]/dd/a/text()’
MINIFIGS_SELECTOR = ‘. //dl[dt/text() = “Minifigs”]/dd[2]/a/text()’
IMAGE_SELECTOR = ‘img::attr(src)’
‘name’: (NAME_SELECTOR). extract_first(),
‘pieces’: (PIECES_SELECTOR). extract_first(),
‘minifigs’: (MINIFIGS_SELECTOR). extract_first(),
‘image’: (IMAGE_SELECTOR). extract_first(), }
Save your changes and run the scraper again:
Now you’ll see that new data in the program’s output:
Output2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 >
{‘minifigs’: ‘5’, ‘pieces’: ‘2380’, ‘name’: ‘Brick Bank’, ‘image’: ”}
2016-09-22 23:52:37 [scrapy] DEBUG: Scraped from <200 >
{‘minifigs’: None, ‘pieces’: ‘1167’, ‘name’: ‘Volkswagen Beetle’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: ‘4163’, ‘name’: ‘Big Ben’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: None, ‘name’: ‘Winter Holiday Train’, ‘image’: ”}
{‘minifigs’: None, ‘pieces’: None, ‘name’: ‘XL Creative Brick Box’, ‘image’: ‘/assets/images/misc/’}
{‘minifigs’: None, ‘pieces’: ‘583’, ‘name’: ‘Creative Building Set’, ‘image’: ”}
Now let’s turn this scraper into a spider that follows links.
Step 3 — Crawling Multiple Pages
We’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.
You’ll notice that the top and bottom of each page has a little right carat (>) that links to the next page of results. Here’s the HTML for that:
- …
-
Web Crawler in Python – TopCoder
With the advent of the era of big data, the need for network information has increased widely. Many different companies collect external data from the Internet for various reasons: analyzing competition, summarizing news stories, tracking trends in specific markets, or collecting daily stock prices to build predictive models. Therefore, web crawlers are becoming more important. Web crawlers automatically browse or grab information from the Internet according to specified assification of web crawlersAccording to the implemented technology and structure, web crawlers can be divided into general web crawlers, focused web crawlers, incremental web crawlers, and deep web workflow of web crawlersBasic workflow of general web crawlersThe basic workflow of a general web crawler is as follows:Get the initial URL. The initial URL is an entry point for the web crawler, which links to the web page that needs to be crawled;While crawling the web page, we need to fetch the HTML content of the page, then parse it to get the URLs of all the pages linked to this these URLs into a queue;Loop through the queue, read the URLs from the queue one by one, for each URL, crawl the corresponding web page, then repeat the above crawling process;Check whether the stop condition is met. If the stop condition is not set, the crawler will keep crawling until it cannot get a new URL. Environmental preparation for web crawlingMake sure that a browser such as Chrome, IE or other has been installed in the wnload and install PythonDownload a suitable IDLThis article uses Visual Studio CodeInstall the required Python packagesPip is a Python package management tool. It provides functions for searching, downloading, installing, and uninstalling Python packages. This tool will be included when downloading and installing Python. Therefore, we can directly use ‘pip install’ to install the libraries we need. 1
2
3
pip install beautifulsoup4
pip install requests
pip install lxml
• BeautifulSoup is a library for easily parsing HTML and XML data. • lxml is a library to improve the parsing speed of XML files. • requests is a library to simulate HTTP requests (such as GET and POST). We will mainly use it to access the source code of any given following is an example of using a crawler to crawl the top 100 movie names and movie introductions on Rotten p100 movies of all time –Rotten TomatoesWe need to extract the name of the movie on this page and its ranking, and go deep into each movie link to get the movie’s introduction. 1. First, you need to import the libraries you need to use. 1
4
import requests
import lxml
from bs4
import BeautifulSoup
2. Create and access URLCreate a URL address that needs to be crawled, then create the header information, and then send a network request to wait for a response. 1
url = ”
f = (url)
When requesting access to the content of a webpage, sometimes you will find that a 403 error will appear. This is because the server has rejected your access. This is the anti-crawler setting used by the webpage to prevent malicious collection of information. At this time, you can access it by simulating the browser header information. 1
5
headers = {
‘User-Agent’: ‘Mozilla/5. 0 (Windows NT 6. 1; WOW64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/63. 0. 3239. 132 Safari/537. 36 QIHU 360SE’}
f = (url, headers = headers)
3. Parse webpageCreate a BeautifulSoup object and specify the parser as = BeautifulSoup(ntent, ‘lxml’)4. Extract informationThe BeautifulSoup library has three methods to find elements:findall():find all nodesfind():find a single nodeselect():finds according to the selector CSS SelectorWe need to get the name and link of the top100 movies. We noticed that the name of the movie needed is under. After extracting the page content using BeautifulSoup, we can use the find method to extract the relevant = (‘table’, {‘class’:’table’}). find_all(‘a’)Get an introduction to each movieAfter extracting the relevant information, you also need to extract the introduction of each movie. The introduction of the movie is in the link of each movie, so you need to click on the link of each movie to get the code is:1
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
movies_lst = []
soup = BeautifulSoup(ntent, ‘lxml’)
movies = (‘table’, {
‘class’: ‘table’}). find_all(‘a’)
num = 0
for anchor in movies:
urls = ” + anchor[‘href’]
(urls)
num += 1
movie_url = urls
movie_f = (movie_url, headers = headers)
movie_soup = BeautifulSoup(ntent, ‘lxml’)
movie_content = (‘div’, {
‘class’: ‘movie_synopsis clamp clamp-6 js-clamp’})
print(num, urls, ‘\n’, ‘Movie:’ + ())
print(‘Movie info:’ + ())
The output is:Write the crawled data to ExcelIn order to facilitate data analysis, the crawled data can be written into Excel. We use xlwt to write data into the xlwt xlwt import *Create an empty table. 1
workbook = Workbook(encoding = ‘utf-8’)
table = d_sheet(‘data’)
Create the header of each column in the first row.
(0, 0, ‘Number’)
(0, 1, ‘movie_url’)
(0, 2, ‘movie_name’)
(0, 3, ‘movie_introduction’)
Write the crawled data into Excel separately from the second row.
(line, 0, num)
(line, 1, urls)
(line, 2, ())
(line, 3, ())
line += 1
Finally, save (”)The final code is:1
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from xlwt
import *
line = 1
(”)
The result is:Click to show preference! Click to show preference!
How to Crawl a Website with DeepCrawl
Running frequent and targeted crawls of your website is a key part of improving it’s technical health and improving rankings in organic search. In this guide, you’ll learn how to a crawl a website efficiently and effectively with DeepCrawl. The six steps to crawling a website include:
Configuring the URL sources
Understanding the domain structure
Running a test crawl
Adding crawl restrictions
Testing your changes
Running your crawl
Step 1: Configuring the URL sources
There are six types of URL sources you can include in your DeepCrawl projects.
Including each one strategically, is the key to an efficient, and comprehensive crawl:
Web crawl: Crawl only the site by following its links to deeper levels.
Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled.
Analytics: Upload analytics source data, and crawl the URLs, to discover additional landing pages on your site which may not be linked. The analytics data will be available in various reports.
Backlinks: Upload backlink source data, and crawl the URLs, to discover additional URLs with backlinks on your site. The backlink data will be available in various reports.
URL lists: Crawl a fixed list of URLs. Links on these pages will not be followed or crawled.
Log files: Upload log file summary data from log file analyser tools, such as Splunk and
Ideally, a website should be crawled in full (including every linked URL on the site). However, very large websites, or sites with many architectural problems, may not be able to be fully crawled immediately. It may be necessary to restrict the crawl to certain sections of the site, or limit specific URL patterns (we’ll cover how to do this below).
Step 2: Understanding the Domain Structure
Before starting a crawl, it’s a good idea to get a better understanding of your site’s domain structure:
Check the www/non-www and / configuration of the domain when you add the domain.
Identify whether the site is using sub-domains.
If you are not sure about sub-domains, check the DeepCrawl “Crawl Subdomains” option and they will automatically be discovered if they are linked.
Step 3: Running a Test Crawl
Start with a small “Web Crawl, ” to look for signs that the site is uncrawlable.
Before starting the crawl, ensure that you have set the “Crawl Limit” to a low quantity. This will make your first checks more efficient, as you won’t have to wait very long to see the results.
Problems to watch for include:
A high number of URLs returning error codes, such as 401 access denied
URLs returned that are not of the correct subdomain – check that the base domain is correct under “Project Settings”.
Very low number of URLs found.
A large number of failed URLs (502, 504, etc).
A large number of canonicalized URLs.
A large number of duplicate pages.
A significant increase in the number of pages found at each level.
To save time, and check for obvious problems immediately, download the URLs during the crawl:
Step 4: Adding Crawl Restrictions
Next, reduce the size of the crawl by identifying anything that can be excluded. Adding restrictions ensures you are not wasting time (or credits) crawling URLs that are not important to you. All the following restrictions can be added within the “Advanced Settings” tab.
Remove Parameters
If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in the “Remove Parameters” field under “Advanced Settings. ”
Add Custom Settings
DeepCrawl’s “Robots Overwrite” feature allows you to identify additional URLs that can be excluded using a custom file – allowing you to test the impact of pushing a new file to a live environment.
Upload the alternative version of your robots file under “Advanced Settings” and select “Use Robots Override” when starting the crawl:
Filter URLs and URL Paths
Use the “Included/Excluded” URL fields under “Advanced Settings” to limit the crawl to specific areas of interest.
Add Crawl Limits for Groups of Pages
Use the “Page Grouping” feature, under “Advanced Settings, ” to restrict the number of URLs crawled for groups of pages based on their URL patterns.
Here, you can add a name.
In the “Page URL Match” column you can add a regular expression.
Add a maximum number of URLs to crawl in the “Crawl Limit” column.
URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the “Page Group Restrictions” report and are not crawled.
Step 5: Testing Your Changes
Run test “Web Crawls” to ensure your configuration is correct and you’re ready to run a full crawl.
Step 6: Running your Crawl
Ensure you’ve increased the “Crawl Limit” before running a more in-depth crawl.
Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data.
If you have specified a subdomain of www within the “Base Domain” setting, subdomains such as blog or default, will not be crawled.
To include subdomains select “Crawl Subdomains” within the “Project Settings” tab.
Set “Scheduling” for your crawls and track your progress.
Handy Tips
Settings for Specific Requirements
If you have a test/sandbox site you can run a “Comparison Crawl” by adding your test site domain and authentication details in “Advanced Settings. ”
For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website.
To crawl an AJAX-style website, with an escaped fragment solution, use the “URL Rewrite” function to modify all linked URLs to the escaped fragment format.
Read more about our testing features – Testing Development Changes Before Putting Them Live.
Changing Crawl Rate
Watch for performance issues caused by the crawler while running a crawl.
If you see connection errors, or multiple 502/503 type errors, you may need to reduce the crawl rate under “Advanced Settings. ”
If you have a robust hosting solution, you may be able to crawl the site at a faster rate.
The crawl rate can be increased at times when the site load is reduced – 4 a. m. for example.
Head to “Advanced Settings” > “Crawl Rate” > “Add Rate Restriction. ”
Analyze Outbound Links
Sites with a large quantity of external links, may want to ensure that users are not directed to dead links.
To check this, select “Crawl External Links” under “Project Settings, ” adding an HTTP status code next to external links within your report.
Read more on outbound link audits to learn about analyzing and cleaning up external links.
Change User Agent
See your site through a variety of crawlers’ eyes (Facebook/Bingbot etc. ) by changing the user agent in “Advanced Settings. ”
Add a custom user agent to determine how your website responds.
After The Crawl
Reset your “Project Settings” after the crawl, so you can continue to crawl with ‘real-world’ settings applied.
Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.
Start your journey with DeepCrawl
If you’re interested in running a crawl with DeepCrawl, discover our range of flexible plans or if you want to find out more about our platform simply drop us a message and we’ll get back to you asap.
Author
Sam Marsden
Sam Marsden is Deepcrawl’s Former SEO & Content Manager. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.Frequently Asked Questions about how to crawl a website using python
How do you crawl a website in Python?
We need to extract the name of the movie on this page and its ranking, and go deep into each movie link to get the movie’s introduction.First, you need to import the libraries you need to use. 1 2 3 4 import requests import lxml from bs4 import BeautifulSoup.Create and access URL. … Parse webpage. … Extract information.Jan 25, 2021
How do I web crawl a website?
The six steps to crawling a website include:Configuring the URL sources.Understanding the domain structure.Running a test crawl.Adding crawl restrictions.Testing your changes.Running your crawl.
Is it legal to crawl a website?
If you’re doing web crawling for your own purposes, it is legal as it falls under fair use doctrine. The complications start if you want to use scraped data for others, especially commercial purposes. … As long as you are not crawling at a disruptive rate and the source is public you should be fine.Jul 17, 2019