Python Web Scraping Example
A Practical Introduction to Web Scraping in Python
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.
In this tutorial, you’ll learn how to:
Parse website data using string methods and regular expressions
Parse website data using an HTML parser
Interact with forms and other website components
Scrape and Parse Text From Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial.
Your First Web Scraper
One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the quest module contains a function called urlopen() that can be used to open a URL within a program.
In IDLE’s interactive window, type the following to import urlopen():
>>>>>> from quest import urlopen
The web page that we’ll open is at the following URL:
>>>>>> url = ”
To open the web page, pass url to urlopen():
>>>>>> page = urlopen(url)
urlopen() returns an HTTPResponse object:
>>>>>> page
< object at 0x105fef820>
To extract the HTML from the page, first use the HTTPResponse object’s () method, which returns a sequence of bytes. Then use () to decode the bytes to a string using UTF-8:
>>>>>> html_bytes = ()
>>> html = (“utf-8”)
Now you can print the HTML to see the contents of the web page:
>>>>>> print(html)
Name: Aphrodite
Favorite animal: Dove
Favorite color: Red
Hometown: Mount Olympus
Once you have the HTML as text, you can extract information from it in a couple of different ways.
A Primer on Regular Expressions
Regular expressions—or regexes for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s re module.
To work with regular expressions, the first thing you need to do is import the re module:
Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more of whatever comes just before the asterisk.
In the following example, you use findall() to find any text within a string that matches a given regular expression:
>>>>>> ndall(“ab*c”, “ac”)
[‘ac’]
The first argument of ndall() is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern “ab*c” in the string “ac”.
The regular expression “ab*c” matches any part of the string that begins with an “a”, ends with a “c”, and has zero or more instances of “b” between the two. ndall() returns a list of all matches. The string “ac” matches this pattern, so it’s returned in the list.
Here’s the same pattern applied to different strings:
>>>>>> ndall(“ab*c”, “abcd”)
[‘abc’]
>>> ndall(“ab*c”, “acc”)
>>> ndall(“ab*c”, “abcac”)
[‘abc’, ‘ac’]
>>> ndall(“ab*c”, “abdc”)
[]
Notice that if no match is found, then findall() returns an empty list.
Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value re. IGNORECASE:
>>>>>> ndall(“ab*c”, “ABC”)
>>> ndall(“ab*c”, “ABC”, re. IGNORECASE)
[‘ABC’]
You can use a period (. ) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters “a” and “c” separated by a single character as follows:
>>>>>> ndall(“a. c”, “abc”)
>>> ndall(“a. c”, “abbc”)
>>> ndall(“a. c”, “ac”)
>>> ndall(“a. c”, “acc”)
[‘acc’]
The pattern. * inside a regular expression stands for any character repeated any number of times. For instance, “a. *c” can be used to find every substring that starts with “a” and ends with “c”, regardless of which letter—or letters—are in between:
>>>>>> ndall(“a. *c”, “abc”)
>>> ndall(“a. *c”, “abbc”)
[‘abbc’]
>>> ndall(“a. *c”, “ac”)
>>> ndall(“a. *c”, “acc”)
Often, you use () to search for a particular pattern inside a string. This function is somewhat more complicated than ndall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and () returns every possible result.
The details of the MatchObject are irrelevant here. For now, just know that calling () on a MatchObject will return the first and most inclusive result, which in most cases is just what you want:
>>>>>> match_results = (“ab*c”, “ABC”, re. IGNORECASE)
>>> ()
‘ABC’
There’s one more function in the re module that’s useful for parsing out text. (), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the. replace() string method.
The arguments passed to () are the regular expression, followed by the replacement text, followed by the string. Here’s an example:
>>>>>> string = “Everything is
>>> string = (“<. *>“, “ELEPHANTS”, string)
>>> string
‘Everything is ELEPHANTS. ‘
Perhaps that wasn’t quite what you expected to happen.
() uses the regular expression “<. *>” to find and replace everything between the first < and last >, which spans from the beginning of
Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:
>>> string = (“<. *? >“, “ELEPHANTS”, string)
“Everything is ELEPHANTS if it’s in ELEPHANTS. ”
This time, () finds two matches,
Check Your Understanding
Expand the block below to check your understanding.
Write a program that grabs the full HTML from the following URL:
Then use () to display the text following “Name:” and “Favorite Color:” (not including any leading spaces or trailing HTML tags that might appear on the same line).
You can expand the block below to see a solution.
First, import the urlopen function from the quest module:
from quest import urlopen
Then open the URL and use the () method of the HTTPResponse object returned by urlopen() to read the page’s HTML:
url = ”
html_page = urlopen(url)
html_text = ()(“utf-8”)
() returns a byte string, so you use () to decode the bytes using the UTF-8 encoding.
Now that you have the HTML source of the web page as a string assigned to the html_text variable, you can extract Dionysus’s name and favorite color from his profile. The structure of the HTML for Dionysus’s profile is the same as Aphrodite’s profile that you saw earlier.
You can get the name by finding the string “Name:” in the text and extracting everything that comes after the first occurence of the string and before the next HTML tag. That is, you need to extract everything after the colon (:) and before the first angle bracket (<). You can use the same technique to extract the favorite color. The following for loop extracts this text for both the name and favorite color: for string in ["Name: ", "Favorite Color:"]: string_start_idx = (string) text_start_idx = string_start_idx + len(string) next_html_tag_offset = html_text[text_start_idx:]("<") text_end_idx = text_start_idx + next_html_tag_offset raw_text = html_text[text_start_idx: text_end_idx] clean_text = (" \r\n\t") print(clean_text) It looks like there’s a lot going on in this forloop, but it’s just a little bit of arithmetic to calculate the right indices for extracting the desired text. Let’s break it down: You use () to find the starting index of the string, either "Name:" or "Favorite Color:", and then assign the index to string_start_idx. Since the text to extract starts just after the colon in "Name:" or "Favorite Color:", you get the index of the the character immediately after the colon by adding the length of the string to start_string_idx and assign the result to text_start_idx. You calculate the ending index of the text to extract by determining the index of the first angle bracket (<) relative to text_start_idx and assign this value to next_html_tag_offset. Then you add that value to text_start_idx and assign the result to text_end_idx. You extract the text by slicing html_text from text_start_idx to text_end_idx and assign this string to raw_text. You remove any whitespace from the beginning and end of raw_text using () and assign the result to clean_text. At the end of the loop, you use print() to display the extracted text. The final output looks like this: This solution is one of many that solves this problem, so if you got the same output with a different solution, then you did great! When you’re ready, you can move on to the next section. Use an HTML Parser for Web Scraping in Python Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with. Install Beautiful Soup To install Beautiful Soup, you can run the following in your terminal: $ python3 -m pip install beautifulsoup4 Run pip show to see the details of the package you just installed: $ python3 -m pip show beautifulsoup4 Name: beautifulsoup4 Version: 4. 9. 1 Summary: Screen-scraping library Home-page: Author: Leonard Richardson Author-email: License: MIT Location: c:\realpython\venv\lib\site-packages Requires: Required-by: In particular, notice that the latest version at the time of writing was 4. 1. Create a BeautifulSoup Object Type the following program into a new editor window: from bs4 import BeautifulSoup page = urlopen(url) html = ()("utf-8") soup = BeautifulSoup(html, "") This program does three things: Opens the URL using urlopen() from the quest module Reads the HTML from the page as a string and assigns it to the html variable Creates a BeautifulSoup object and assigns it to the soup variable The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "", tells the object which parser to use behind the scenes. "" represents Python’s built-in HTML parser. Use a BeautifulSoup Object Save and run the above program. When it’s finished running, you can use the soup variable in the interactive window to parse the content of html in various ways. For example, BeautifulSoup objects have a. get_text() method that can be used to extract all the text from the document and automatically remove any HTML tags. Type the following code into IDLE’s interactive window: >>>>>> print(t_text())
Profile: Dionysus
Name: Dionysus
Favorite animal: Leopard
Favorite Color: Wine
There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string. replace() method if you need to.
Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the () string method is sometimes easier than working with regular expressions.
However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of HTML tags.
In this case, you can use find_all() to return a list of all instances of that particular tag:
>>>>>> nd_all(“img”)
[, ]
This returns a list of all tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.
Let’s explore this a little by first unpacking the Tag objects from the list:
>>>>>> image1, image2 = nd_all(“img”)
Each Tag object has a property that returns a string containing the HTML tag type:
You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.
For example, the tag has a single attribute, src, with the value “/static/”. Likewise, an HTML tag such as the link has two attributes, href and target.
To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:
>>>>>> image1[“src”]
‘/static/’
>>> image2[“src”]
Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the
>>>>>>
If you look at the source of the Dionysus profile by navigating to the profile page, right-clicking on the page, and selecting View page source, then you’ll notice that the
Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.
You can also retrieve just the string between the title tags with the property of the Tag object:
‘Profile: Dionysus’
One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the tags that have a src attribute equal to the value /static/, then you can provide the following additional argument to. find_all():
>>>>>> nd_all(“img”, src=”/static/”)
[]
This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures.
When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.
Then, instead of relying on complicated regular expressions or using () to search through the document, you can directly access the particular tag you’re interested in and extract the data you need.
In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The lxml library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup.
BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far.
Write a program that grabs the full HTML from the page at the URL Using Beautiful Soup, print out a list of all the links on the page by looking for HTML tags with the name a and retrieving the value taken on by the href attribute of each tag.
The final output should look like this:
You can expand the block below to see a solution:
First, import the urlopen function from the quest module and the BeautifulSoup class from the bs4 package:
Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:
base_url = ”
You can build a full URL by concatenating base_url with a relative URL.
Now open the /profiles page with urlopen() and use () to get the HTML source:
html_page = urlopen(base_url + “/profiles”)
With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:
soup = BeautifulSoup(html_text, “”)
nd_all(“a”) returns a list of all links in the HTML source. You can loop over this list to print out all the links on the webpage:
for link in nd_all(“a”):
link_url = base_url + link[“href”]
print(link_url)
The relative URL for each link can be accessed through the “href” subscript. Concatenate this value with base_url to create the full link_url.
Interact With HTML Forms
The urllib module you’ve been working with so far in this tutorial is well suited for requesting the contents of a web page. Sometimes, though, you need to interact with a web page to obtain the content you need. For example, you might need to submit a form or click a button to display hidden content.
The Python standard library doesn’t provide a built-in means for working with web pages interactively, but many third-party packages are available from PyPI. Among these, MechanicalSoup is a popular and relatively straightforward package to use.
In essence, MechanicalSoup installs what’s known as a headless browser, which is a web browser with no graphical user interface. This browser is controlled programmatically via a Python program.
Install MechanicalSoup
You can install MechanicalSoup with pip in your terminal:
$ python3 -m pip install MechanicalSoup
You can now view some details about the package with pip show:
$ python3 -m pip show mechanicalsoup
Name: MechanicalSoup
Version: 0. 12. 0
Summary: A Python library for automating interaction with websites
Home-page: Author: UNKNOWN
Author-email: UNKNOWN
Requires: requests, beautifulsoup4, six, lxml
In particular, notice that the latest version at the time of writing was 0. 0. You’ll need to close and restart your IDLE session for MechanicalSoup to load and be recognized after it’s been installed.
Create a Browser Object
Type the following into IDLE’s interactive window:
>>>>>> import mechanicalsoup
>>> browser = owser()
Browser objects represent the headless web browser. You can use them to request a page from the Internet by passing a URL to their () method:
>>> page = (url)
page is a Response object that stores the response from requesting the URL from the browser:
The number 200 represents the status code returned by the request. A status code of 200 means that the request was successful. An unsuccessful request might show a status code of 404 if the URL doesn’t exist or 500 if there’s a server error when making the request.
MechanicalSoup uses Beautiful Soup to parse the HTML from the request. page has a attribute that represents a BeautifulSoup object:
>>>>>> type()
You can view the HTML by inspecting the attribute:
Please log in to access Mount Olympus:
Notice this page has a
Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class nd_all(‘p’, class_=’outer-text’)
[
First outer paragraph.
,
Second outer paragraph.
]In the below example, we’ll look for any tag that has the class nd_all(class_=”outer-text”)
,
]We can also search for elements by nd_all(id=”first”)
[
]Using CSS SelectorsWe can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:p a — finds all a tags inside of a p p a — finds all a tags inside of a p tag inside of a body body — finds all body tags inside of an html — finds all p tags with a class of outer-text. p#first — finds all p tags with an id of — finds any p tags with a class of outer-text inside of a body can learn more about CSS selectors autifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like (“div p”)
,
]Note that the select method above returns a list of BeautifulSoup objects, just like find and wnloading weather dataWe now know enough to proceed with extracting information about the local weather from the National Weather Service website! The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page. Specifically, let’s extract data about the extended we can see from the image, the page has information about the extended forecast for the next week, including time of day, temperature, and a brief description of the conditions. Exploring page structure with Chrome DevToolsThe first thing we’ll need to do is inspect the page using Chrome Devtools. If you’re using another browser, Firefox and Safari have can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted:Chrome Developer ToolsThe elements panel will show you all the HTML tags on the page, and let you navigate through them. It’s a really handy feature! By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel:The extended forecast textWe can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast:The div that contains the extended forecast we click around on the console, and explore the div, we’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class to Start Scraping! We now know enough to download the page and start parsing it. In the below code, we will:Download the web page containing the a BeautifulSoup class to parse the the div with id seven-day-forecast, and assign to seven_dayInside seven_day, find each individual forecast item. Extract and print the first forecast = (“)
seven_day = (id=”seven-day-forecast”)
forecast_items = nd_all(class_=”tombstone-container”)
tonight = forecast_items[0]
print(ettify())
Tonight
Mostly Clear
Low: 49 °F
Extracting information from the pageAs we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:The name of the forecast item — in this case, description of the conditions — this is stored in the title property of img. A short description of the conditions — in this case, Mostly temperature low — in this case, 49 ’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:period = (class_=”period-name”). get_text()
short_desc = (class_=”short-desc”). get_text()
temp = (class_=”temp”). get_text()
print(period)
print(short_desc)
print(temp)
Low: 49 °FNow, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:img = (“img”)
desc = img[‘title’]
print(desc)
Tonight: Mostly clear, with a low around 49. Extracting all the information from the pageNow that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at the below code, we will:Select all items with the class period-name inside an item with the class tombstone-container in a list comprehension to call the get_text method on each BeautifulSoup riod_tags = (“. tombstone-container “)
periods = [t_text() for pt in period_tags]
periods
[‘Tonight’,
‘Thursday’,
‘ThursdayNight’,
‘Friday’,
‘FridayNight’,
‘Saturday’,
‘SaturdayNight’,
‘Sunday’,
‘SundayNight’]As we can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other three fields:short_descs = [t_text() for sd in (“. tombstone-container “)]
temps = [t_text() for t in (“. tombstone-container “)]
descs = [d[“title”] for d in (“. tombstone-container img”)]print(short_descs)print(temps)print(descs)
[‘Mostly Clear’, ‘Sunny’, ‘Mostly Clear’, ‘Sunny’, ‘Slight ChanceRain’, ‘Rain Likely’, ‘Rain Likely’, ‘Rain Likely’, ‘Chance Rain’]
[‘Low: 49 °F’, ‘High: 63 °F’, ‘Low: 50 °F’, ‘High: 67 °F’, ‘Low: 57 °F’, ‘High: 64 °F’, ‘Low: 57 °F’, ‘High: 64 °F’, ‘Low: 55 °F’]
[‘Tonight: Mostly clear, with a low around 49. ‘, ‘Thursday: Sunny, with a high near 63. North wind 3 to 5 mph. ‘, ‘Thursday Night: Mostly clear, with a low around 50. Light and variable wind becoming east southeast 5 to 8 mph after midnight. ‘, ‘Friday: Sunny, with a high near 67. Southeast wind around 9 mph. ‘, ‘Friday Night: A 20 percent chance of rain after 11pm. Partly cloudy, with a low around 57. South southeast wind 13 to 15 mph, with gusts as high as 20 mph. New precipitation amounts of less than a tenth of an inch possible. ‘, ‘Saturday: Rain likely. Cloudy, with a high near 64. Chance of precipitation is 70%. New precipitation amounts between a quarter and half of an inch possible. ‘, ‘Saturday Night: Rain likely. Cloudy, with a low around 57. Chance of precipitation is 60%. ‘, ‘Sunday: Rain likely. ‘, ‘Sunday Night: A chance of rain. Mostly cloudy, with a low around 55. ‘]Combining our data into a Pandas DataframeWe can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. If you want to learn more about Pandas, check out our free to start course order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary key will become a column in the DataFrame, and each list will become the values in the column:import pandas as pd
weather = Frame({
“period”: periods,
“short_desc”: short_descs,
“temp”: temps,
“desc”:descs})
weather
desc
period
short_desc
temp
0
Tonight: Mostly clear, with a low around 49. W…
1
Thursday: Sunny, with a high near 63. North wi…
Thursday
Sunny
High: 63 °F
2
Thursday Night: Mostly clear, with a low aroun…
ThursdayNight
Low: 50 °F
3
Friday: Sunny, with a high near 67. Southeast …
Friday
High: 67 °F
4
Friday Night: A 20 percent chance of rain afte…
FridayNight
Slight ChanceRain
Low: 57 °F
5
Saturday: Rain likely. Cloudy, with a high ne…
Saturday
Rain Likely
High: 64 °F
6
Saturday Night: Rain likely. Cloudy, with a l…
SaturdayNight
7
Sunday: Rain likely. Cloudy, with a high near…
Sunday
8
Sunday Night: A chance of rain. Mostly cloudy…
SundayNight
Chance Rain
Low: 55 °F
We can now do some analysis on the data. For example, we can use a regular expression and the method to pull out the numeric temperature values:temp_nums = weather[“temp”](“(? Pd+)”, expand=False)
weather[“temp_num”] = (‘int’)
temp_nums
0 49
1 63
2 50
3 67
4 57
5 64
6 57
7 64
8 55
Name: temp_num, dtype: objectWe could then find the mean of all the high and low temperatures:weather[“temp_num”]()
58. 444444444444443We could also only select the rows that happen at night:is_night = weather[“temp”](“Low”)
weather[“is_night”] = is_night
is_night
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
Name: temp, dtype: boolweather[is_night]
Name: temp, dtype: bool
temp_num
49
True
50
57
55
Next Steps For This Web Scraping ProjectIf you’ve made it this far, congratulations! You should now have a good understanding of how to scrape web pages and extract data. Of course, there’s still a lot more to learn! If you want to go further, a good next step would be to pick a site and try some web scraping on your own. Some good examples of data to scrape are:News articlesSports scoresWeather forecastsStock pricesOnline retailer pricesYou may also want to keep scraping the National Weather Service, and see what other data you can extract from the page, or about your own ternatively, if you want to take your web scraping skills to the next level, you can check out our interactive course, which covers both the basics of web scraping and using Python to connect to APIs. With those two skills under your belt, you’ll be able to collect lots of unique and interesting datasets from sites all over the web! Learn to scrape the web with Python, right in your browser! Our interactive APIs and Web Scraping in Python skill path will help you learn the skills you need to unlock new worlds of data with Python. (No credit card required! )beginner, data mining, python, python tutorials, scraping, tutorial, Tutorials, web scraping
Python Web Scraping Tutorial: Step-By-Step [2021 Guide] – Blog
Getting started in web scraping is simple except when it isn’t which is why you are here. Python is one of the easiest ways to get started as it is an object-oriented language. Python’s classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze.
In this web scraping Python tutorial, we will outline everything needed to get started with a simple application. It will acquire text-based data from page sources, store it into a file and sort the output according to set parameters. Options for more advanced features when using Python for web scraping will be outlined at the very end with suggestions for implementation. By following the steps outlined below in this tutorial, you will be able to understand how to do web scraping.
What do we call web scraping? Web scraping is an automated process of gathering public data. Web scrapers automatically extract large amounts of public data from target websites in seconds.
This Python web scraping tutorial will work for all operating systems. There will be slight differences when installing either Python or development environments but not in anything else.
Building a web scraper: Python prepwork
Getting to the libraries
WebDrivers and browsers
Finding a cozy place for our Python web scraper
Importing and using libraries
Picking a URL
Defining objects and building lists
Extracting data with our Python web scraper
Exporting the data
More lists. More!
Web scraping with Python best practices
Conclusion
Throughout this entire web scraping tutorial, Python 3. 4+ version will be used. Specifically, we used 3. 8. 3 but any 3. 4+ version should work just fine.
For Windows installations, when installing Python make sure to check “PATH installation”. PATH installation adds executables to the default Windows Command Prompt executable search. Windows will then recognize commands like “pip” or “python” without requiring users to point it to the directory of the executable (e. g. C:/tools/python/…/). If you have already installed Python but did not mark the checkbox, just rerun the installation and select modify. On the second screen select “Add to environment variables”.
Web scraping with Python is easy due to the many useful libraries available
One of the Python advantages is a large selection of libraries for web scraping. These web scraping libraries are part of thousands of Python projects in existence – on PyPI alone, there are over 300, 000 projects today. Notably, there are several types of Python web scraping libraries from which you can choose:
RequestsBeautiful SouplxmlSelenium
Requests library
Web scraping starts with sending HTTP requests, such as POST or GET, to a website’s server, which returns a response containing the needed data. However, standard Python HTTP libraries are difficult to use and, for effectiveness, require bulky lines of code, further compounding an already problematic issue.
Unlike other HTTP libraries, the Requests library simplifies the process of making such requests by reducing the lines of code, in effect making the code easier to understand and debug without impacting its effectiveness. The library can be installed from within the terminal using the pip command:
pip install requests
Requests library provides easy methods for sending HTTP GET and POST requests. For example, the function to send an HTTP Get request is aptly named get():
import requests
response = (“)
print()
If there is a need for a form to be posted, it can be done easily using the post() method. The form data can sent as a dictionary as follows:
form_data = {‘key1’: ‘value1’, ‘key2’: ‘value2’}
response = (” “, data=form_data)
Requests library also makes it very easy to use proxies that require authentication.
proxies={”: ”}
response = (”, proxies=proxies)
But this library has a limitation in that it does not parse the extracted HTML data, i. e., it cannot convert the data into a more readable format for analysis. Also, it cannot be used to scrape websites that are written using purely JavaScript.
Beautiful Soup
Beautiful Soup is a Python library that works with a parser to extract data from HTML and can turn even invalid markup into a parse tree. However, this library is only designed for parsing and cannot request data from web servers in the form of HTML documents/files. For this reason, it is mostly used alongside the Python Requests Library. Note that Beautiful Soup makes it easy to query and navigate the HTML, but still requires a parser. The following example demonstrates the use of the module, which is part of the Python Standard Library.
#Part 1 – Get the HTML using Requests
url=”
response = (url)
#Part 2 – Find the element
from bs4 import BeautifulSoup
soup = BeautifulSoup(, ”)
This will print the title element as follows:
Oxylabs Blog
Due to its simple ways of navigating, searching and modifying the parse tree, Beautiful Soup is ideal even for beginners and usually saves developers hours of work. For example, to print all the blog titles from this page, the findAll() method can be used. On this page, all the blog titles are in h2 elements with class attribute set to blog-card__content-title. This information can be supplied to the findAll method as follows:
blog_titles = ndAll(‘h2’, attrs={“class”:”blog-card__content-title”})
for title in blog_titles:
# Output:
# Prints all blog tiles on the page
BeautifulSoup also makes it easy to work with CSS selectors. If a developer knows a CSS selector, there is no need to learn find() or find_all() methods. The following is the same example, but uses CSS selectors:
blog_titles = (”)
While broken-HTML parsing is one of the main features of this library, it also offers numerous functions, including the fact that it can detect page encoding further increasing the accuracy of the data extracted from the HTML file.
What is more, it can be easily configured, with just a few lines of code, to extract any custom publicly available data or to identify specific data types. Our Beautiful Soup tutorial contains more on this and other configurations, as well as how this library works.
lxml
lxml is a parsing library. It is a fast, powerful, and easy-to-use library that works with both HTML and XML files. Additionally, lxml is ideal when extracting data from large datasets. However, unlike Beautiful Soup, this library is impacted by poorly designed HTML, making its parsing capabilities impeded.
The lxml library can be installed from the terminal using the pip command:
pip install lxml
This library contains a module html to work with HTML. However, the lxml library needs the HTML string first. This HTML string can be retrieved using the Requests library as discussed in the previous section. Once the HTML is available, the tree can be built using the fromstring method as follows:
# After response = ()
from lxml import html
tree = omstring()
This tree object can now be queried using XPath. Continuing the example discussed in the previous section, to get the title of the blogs, the XPath would be as follows:
//h2[@class=”blog-card__content-title”]/text()
This XPath can be given to the () function. This will return all the elements matching this XPath. Notice the text() function in the XPath. This will extract the text within the h2 elements.
blog_titles = (‘//h2[@class=”blog-card__content-title”]/text()’)
print(title)
Suppose you are looking to learn how to use this library and integrate it into your web scraping efforts or even gain more knowledge on top of your existing expertise. In that case, our detailed lxml tutorial is an excellent place to start.
Selenium
As stated, some websites are written using JavaScript, a language that allows developers to populate fields and menus dynamically. This creates a problem for Python libraries that can only extract data from static web pages. In fact, as stated, the Requests library is not an option when it comes to JavaScript. This is where Selenium web scraping comes in and thrives.
This Python web library is an open-source browser automation tool (web driver) that allows you to automate processes such as logging into a social media platform. Selenium is widely used for the execution of test cases or test scripts on web applications. Its strength during web scraping derives from its ability to initiate rendering web pages, just like any browser, by running JavaScript – standard web crawlers cannot run this programming language. Yet, it is now extensively used by developers.
Selenium requires three components:
Web Browser – Supported browsers are Chrome, Edge, Firefox and SafariDriver for the browser – See this page for links to the driversThe selenium package
The selenium package can be installed from the terminal:
pip install selenium
After installation, the appropriate class for the browser can be imported. Once imported, the object of the class will have to be created. Note that this will require the path of the driver executable. Example for the Chrome browser as follows:
from selenium. webdriver import Chrome
driver = Chrome(executable_path=’/path/to/driver’)
Now any page can be loaded in the browser using the get() method.
(”)
Selenium allows use of CSS selectors and XPath to extract elements. The following example prints all the blog titles using CSS selectors:
blog_titles = t_elements_by_css_selector(‘ ‘)
for title in blog_tiles:
() # closing the browser
Basically, by running JavaScript, Selenium deals with any content being displayed dynamically and subsequently makes the webpage’s content available for parsing by built-in methods or even Beautiful Soup. Moreover, it can mimic human behavior.
The only downside to using Selenium in web scraping is that it slows the process because it must first execute the JavaScript code for each page before making it available for parsing. As a result, it is unideal for large-scale data extraction. But if you wish to extract data at a lower-scale or the lack of speed is not a drawback, Selenium is a great choice.
Web scraping Python libraries compared
RequestsBeautiful SouplxmlSeleniumPurposeSimplify making HTTP requestsParsingParsingSimplify making HTTP requestsEase-of-useHighHighMediumMediumSpeedFastFastVery fastSlowLearning CurveVery easy (beginner-friendly)Very easy (beginner-friendly)EasyEasyDocumentationExcellentExcellentGoodGoodJavaScript SupportNoneNoneNoneYesCPU and Memory UsageLowLowLowHighSize of Web Scraping Project SupportedLarge and smallLarge and smallLarge and smallSmall
For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium. Further steps in this guide assume a successful installation of these libraries. If you receive a “NameError: name * is not defined” it is likely that one of these installations has failed.
Every web scraper uses a browser as it needs to connect to the destination URL. For testing purposes we highly recommend using a regular browser (or not a headless one), especially for newcomers. Seeing how written code interacts with the application allows simple troubleshooting and debugging, and grants a better understanding of the entire process.
Headless browsers can be used later on as they are more efficient for complex tasks. Throughout this web scraping tutorial we will be using the Chrome web browser although the entire process is almost identical with Firefox.
To get started, use your preferred search engine to find the “webdriver for Chrome” (or Firefox). Take note of your browser’s current version. Download the webdriver that matches your browser’s version.
If applicable, select the requisite package, download and unzip it. Copy the driver’s executable file to any easily accessible directory. Whether everything was done correctly, we will only be able to find out later on.
One final step needs to be taken before we can get to the programming part of this web scraping tutorial: using a good coding environment. There are many options, from a simple text editor, with which simply creating a * file and writing the code down directly is enough, to a fully-featured IDE (Integrated Development Environment).
If you already have Visual Studio Code installed, picking this IDE would be the simplest option. Otherwise, I’d highly recommend PyCharm for any newcomer as it has very little barrier to entry and an intuitive UI. We will assume that PyCharm is used for the rest of the web scraping tutorial.
In PyCharm, right click on the project area and “New -> Python File”. Give it a nice name!
Time to put all those pips we installed previously to use:
import pandas as pd
from selenium import webdriver
PyCharm might display these imports in grey as it automatically marks unused libraries. Don’t accept its suggestion to remove unused libs (at least yet).
We should begin by defining our browser. Depending on the webdriver we picked back in “WebDriver and browsers” we should type in:
driver = (executable_path=’c:\path\to\windows\webdriver\’)
OR
driver = refox(executable_path=’/nix/path/to/webdriver/executable’)
Python web scraping requires looking into the source of websites
Before performing our first test run, choose a URL. As this web scraping tutorial is intended to create an elementary application, we highly recommended picking a simple target URL:
Avoid data hidden in Javascript elements. These sometimes need to be triggered by performing specific actions in order to display the required data. Scraping data from Javascript elements requires more sophisticated use of Python and its image scraping. Images can be downloaded directly with conducting any scraping activities ensure that you are scraping public data, and are in no way breaching third-party rights. Also, don’t forget to check the file for guidance.
Select the landing page you want to visit and input the URL into the (‘URL’) parameter. Selenium requires that the connection protocol is provided. As such, it is always necessary to attach “” or “” to the URL.
Try doing a test run by clicking the green arrow at the bottom left or by right clicking the coding environment and selecting ‘Run’.
Follow the red pointer
If you receive an error message stating that a file is missing then turn double check if the path provided in the driver “webdriver. *” matches the location of the webdriver executable. If you receive a message that there is a version mismatch redownload the correct webdriver executable.
Python allows coders to design objects without assigning an exact type. An object can be created by simply typing its title and assigning a value.
# Object is “results”, brackets make the object an empty list.
# We will be storing our data here.
results = []
Lists in Python are ordered, mutable and allow duplicate members. Other collections, such as sets or dictionaries, can be used but lists are the easiest to use. Time to make more objects!
# Add the page source to the variable `content`.
content = ge_source
# Load the contents of the page, its source, into BeautifulSoup
# class, which analyzes the HTML as a nested data structure and allows to select
# its elements by using various selectors.
soup = BeautifulSoup(content)
Before we go on with, let’s recap on how our code should look so far:
driver = (executable_path=’/nix/path/to/webdriver/executable’)
Try rerunning the application again. There should be no errors displayed. If any arise, a few possible troubleshooting options were outlined in earlier chapters.
We have finally arrived at the fun and difficult part – extracting data out of the HTML file. Since in almost all cases we are taking small sections out of many different parts of the page and we want to store it into a list, we should process every smaller section and then add it to the list:
# Loop over all elements returned by the `findAll` call. It has the filter `attrs` given
# to it in order to limit the data returned to those elements with a given class only.
for element in ndAll(attrs={‘class’: ‘list-item’}):…
“ndAll” accepts a wide array of arguments. For the purposes of this tutorial we only use “attrs” (attributes). It allows us to narrow down the search by setting up a statement “if attribute is equal to X is true then…”. Classes are easy to find and use therefore we shall use those.
Let’s visit the chosen URL in a real browser before continuing. Open the page source by using CTRL+U (Chrome) or right click and select “View Page Source”. Find the “closest” class where the data is nested. Another option is to press F12 to open DevTools to select Element Picker. For example, it could be nested as:
This is a Title
Our attribute, “class”, would then be “title”. If you picked a simple target, in most cases data will be nested in a similar way to the example above. Complex targets might require more effort to get the data out. Let’s get back to coding and add the class we found in the source:
# Change ‘list-item’ to ‘title’.
for element in ndAll(attrs={‘class’: ‘title’}):…
Our loop will now go through all objects with the class “title” in the page source. We will process each of them:
name = (‘a’)
Let’s take a look at how our loop goes through the HTML:
Our first statement (in the loop itself) finds all elements that match tags, whose “class” attribute contains “title”. We then execute another search within that class. Our next search finds all the tags in the document ( is included while partial matches like are not). Finally, the object is assigned to the variable “name”.
We could then assign the object name to our previously created list array “results” but doing this would bring the entire tag with the text inside it into one element. In most cases, we would only need the text itself without any additional tags.
# Add the object of “name” to the list “results”.
# `
()
Our loop will go through the entire page source, find all the occurrences of the classes listed above, then append the nested data to our list:
for element in ndAll(attrs={‘class’: ‘title’}):
Note that the two statements after the loop are indented. Loops require indentation to denote nesting. Any consistent indentation will be considered legal. Loops without indentation will output an “IndentationError” with the offending statement pointed out with the “arrow”.
Python web scraping requires constant double-checking of the code
Even if no syntax or runtime errors appear when running our program, there still might be semantic errors. You should check whether we actually get the data assigned to the right object and move to the array correctly.
One of the simplest ways to check if the data you acquired during the previous steps is being collected correctly is to use “print”. Since arrays have many different values, a simple loop is often used to separate each entry to a separate line in the output:
for x in results:
print(x)
Both “print” and “for” should be self-explanatory at this point. We are only initiating this loop for quick testing and debugging purposes. It is completely viable to print the results directly:
print(results)
So far our code should look like this:
for a in ndAll(attrs={‘class’: ‘class’}):
if name not in results:
Running our program now should display no errors and display acquired data in the debugger window. While “print” is great for testing purposes, it isn’t all that great for parsing and analyzing data.
You might have noticed that “import pandas” is still greyed out so far. We will finally get to put the library to good use. I recommend removing the “print” loop for now as we will be doing something similar but moving our data to a csv file.
df = Frame({‘Names’: results})
_csv(”, index=False, encoding=’utf-8′)
Our two new statements rely on the pandas library. Our first statement creates a variable “df” and turns its object into a two-dimensional data table. “Names” is the name of our column while “results” is our list to be printed out. Note that pandas can create multiple columns, we just don’t have enough lists to utilize those parameters (yet).
Our second statement moves the data of variable “df” to a specific file type (in this case “csv”). Our first parameter assigns a name to our soon-to-be file and an extension. Adding an extension is necessary as “pandas” will otherwise output a file without one and it will have to be changed manually. “index” can be used to assign specific starting numbers to columns. “encoding” is used to save data in a specific format. UTF-8 will be enough in almost all cases.
No imports should now be greyed out and running our application should output a “” into our project directory. Note that a “Guessed At Parser” warning remains. We could remove it by installing a third party parser but for the purposes of this Python web scraping tutorial the default HTML option will do just fine.
Python web scraping often requires many data points
Many web scraping operations will need to acquire several sets of data. For example, extracting just the titles of items listed on an e-commerce website will rarely be useful. In order to gather meaningful information and to draw conclusions from it at least two data points are needed.
For the purposes of this tutorial, we will try something slightly different. Since acquiring data from the same class would just mean appending to an additional list, we should attempt to extract data from a different class but, at the same time, maintain the structure of our table.
Obviously, we will need another list to store our data in.
other_results = []
for b in ndAll(attrs={‘class’: ‘otherclass’}):
# Assume that data is nested in ‘span’.
name2 = (‘span’)
Since we will be extracting an additional data point from a different part of the HTML, we will need an additional loop. If needed we can also add another “if” conditional to control for duplicate entries:
Finally, we need to change how our data table is formed:
df = Frame({‘Names’: results, ‘Categories’: other_results})
So far the newest iteration of our code should look something like this:
name = (‘a’)
If you are lucky, running this code will output no error. In some cases “pandas” will output an “ValueError: arrays must all be the same length” message. Simply put, the length of the lists “results” and “other_results” is unequal, therefore pandas cannot create a two-dimensional table.
There are dozens of ways to resolve that error message. From padding the shortest list with “empty” values, to creating dictionaries, to creating two series and listing them out. We shall do the third option:
series1 = (results, name = ‘Names’)
series2 = (other_results, name = ‘Categories’)
df = Frame({‘Names’: series1, ‘Categories’: series2})
Note that data will not be matched as the lists are of uneven length but creating two series is the easiest fix if two data points are needed. Our final code should look something like this:
Running it should create a csv file named “names” with two columns of data.
Our first web scraper should now be fully functional. Of course it is so basic and simplistic that performing any serious data acquisition would require significant upgrades. Before moving on to greener pastures, I highly recommend experimenting with some additional features:
Create matched data extraction by creating a loop that would make lists of an even length.
Scrape several URLs in one go. There are many ways to implement such a feature. One of the simplest options is to simply repeat the code above and change URLs each time. That would be quite boring. Build a loop and an array of URLs to visit.
Another option is to create several arrays to store different sets of data and output it into one file with different rows. Scraping several different types of information at once is an important part of e-commerce data acquisition.
Once a satisfactory web scraper is running, you no longer need to watch the browser perform its actions. Get headless versions of either Chrome or Firefox browsers and use those to reduce load times.
Create a scraping pattern. Think of how a regular user would browse the internet and try to automate their actions. New libraries will definitely be needed. Use “import time” and “from random import randint” to create wait times between pages. Add “scrollto()” or use specific key inputs to move around the browser. It’s nearly impossible to list all of the possible options when it comes to creating a scraping pattern.
Create a monitoring process. Data on certain websites might be time (or even user) sensitive. Try creating a long-lasting loop that rechecks certain URLs and scrapes data at set intervals. Ensure that your acquired data is always fresh.
Make use of the Python Requests library. Requests is a powerful asset in any web scraping toolkit as it allows to optimize HTTP methods sent to servers.
Finally, integrate proxies into your web scraper. Using location specific request sources allows you to acquire data that might otherwise be inaccessible.
If you enjoy video content more, watch our embedded, simplified version of the web scraping tutorial!
You can also watch our video version of the tutorial!
From here onwards, you are on your own. Building web scrapers in Python, acquiring data and drawing conclusions from large amounts of information is inherently an interesting and complicated process.
If you want to find out more about how proxies or advanced data acquisition tools work, or about specific web scraping use cases, such as web scraping job postings or building a yellow page scraper, check out our blog. We have enough articles for everyone: a more detailed guide on how to avoid blocks when scraping, is web scraping legal, an in-depth walkthrough on what is a proxy and many more!
Adomas Sulcas is a Content Manager at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.
All information on Oxylabs Blog is provided on an “as is” basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website’s terms of service or receive a scraping license.
Frequently Asked Questions about python web scraping example
Is web scraping with Python legal?
So is it legal or illegal? Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. … Big companies use web scrapers for their own gain but also don’t want others to use bots against them.
What is web scraping in Python with example?
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically.
How do I use Python to scrape a website?
To extract data using web scraping with python, you need to follow these basic steps:Find the URL that you want to scrape.Inspecting the Page.Find the data you want to extract.Write the code.Run the code and extract the data.Store the data in the required format.Sep 24, 2021