Web Crawler Python Selenium
Build a scalable web crawler with Selenium and Python
Implementation within the Google Cloud Platform by using Docker, Kubernetes Engine and Cloud DatastoreFig. 1 — Image from Pixabay (Pixabay License)Disclaimer: Since scraping of Services is prohibited by the terms of use, I would like to point out that we immediately processed the underlying data within the project with NLP and no storing of the pure texts took place. The approach illustrated in this article is therefore for demonstration purposes only and can be used for other websites that allow web article is part of a larger project. If you are also interested in performing Natural Language Processing on the results to extract technology names by using PySpark and Kubernetes or building highly scalable Dashboards in Python, you will find corresponding links at the end of the troductionProject Idea and approachSource Inspection and PackagesImplementation StepsResultsLife as a Data Scientist can be tough. It is not only the acquisition and quality of data and its interpretability that poses challenges. The rapid development of technologies, as well as constantly rising expectations from business (keyword rocket science), also make the work more difficult. However, in my experience, the acquisition and application of new technologies, in particular, is a source of enthusiasm for most data scientists. For this reason, I built a scalable web crawler with common technologies to improve my files and code snippets that are referenced in this article can be found in my GitHub wards Data Science (TWDS) is one of the best known and most instructive places to go for data science. It is a publication on which a large number of authors have published various articles. Recurrently used technologies are referenced and their use is often presented in case erefore I decided to build a web crawler that extracts the content of TWDS and stores it inside the NoSQL database “Google Datastore”. To make the web crawler scalable, I used Docker for containerizing my application and Kubernetes for the 2 — Technical Overview of the scalable infrastructureThe approach was to develop the web crawler in a Jupyter Notebook on my local machine and to constantly professionalize and increase the project (see Fig 2). For instance, I built a Python application with a dedicated crawler class and all necessary methods based on the Jupyter Notebook scripts. But let us have a more detailed look at the implementation steps. 3. 1 Source inspectionTo develop a properly operating web crawler, it is important to familiarize yourself in advance with the site structure, available content, and 3 — Connection of the relevant entitiesTWDS is a classic publication with many authors and a lot of articles. Thanks to an archive page it was easy to understand the page structure in detail (see Fig. 3). Fortunately, the authors were not only listed there but also provided with links that led to overview pages for these 4 — Page source-code for authors list on TWDS-ArchiveThe used HTML class was constantly used so that the links could easily be identified (see Fig. 4) the overview pages of the authors, I figured out that at first only the author’s articles published on TWDS were listed. Other articles published on by the author were not displayed. It was therefore not necessary to check whether the specific article belonged to the TWDS publication. Unfortunately, the HTML class for these links was empty and the links could not be identified. However, the links contained the complete URL and thus the word “towards”. Therefore, the identification of these links was just as unambiguous. However, another challenge occurred when examining the page. Not all of the author’s articles were displayed directly, but when the website was scrolled down further content was dynamically reloaded using Javascript. To ensure completeness, this had to be taken into account for the development of the web 5 — Example of the HTML source code from a TWDS-ArticleFinally, I had to examine the structure of the individual articles for similarities and patterns to extract the relevant data fields. The required properties were author, URL, title, text, reading time, publishing date, tags, claps and the number of responses. As can be seen in Figure 5, the HTML source code has some challenges. For example, the class names are seemingly dynamically generated and have only minor matches across articles. But there are also rays of hope, e. g. reading time, title, URL, and publishing date are standardized in the page header. The remaining content was reasonably easy to access. 2 Package SelectionAt first, during development in Jupyter Notebooks, I was looking for Python packages I could use to fulfill all requirements. I quickly realized for Scrapy, one of the most commonly used packages for web scraping, that dynamic content reloading would be difficult. After focusing on this requirement, I became aware of Selenium. Selenium is a framework for automated software testing of web applications and can interact with browsers, e. to scroll down pages to load the dynamic javascript content and receive the full HTML source work with the extracted HTML source code, I found the Python package BeautifulSoup4, which provides various methods to systematically search the HTML tree structure for relevant content. With these packages selected, I could fulfill all the requirements to develop a web crawler. 4. 1 Development of a Python-based web crawlerDuring the development, I now worked along with the page structure shown in figure 3. So I started with the extraction of the author list. I defined the URL “ to be crawled and used to start the Selenium Webdriver. In the following, I extracted all required parts of the code to run the Selenium Webdriver. # Importfrom selenium import webdriverfrom import Options# Define Browser Optionschrome_options = Options()d_argument(“–headless”) # Hides the browser window# Reference the local Chromedriver instancechrome_path = r’/usr/local/bin/chromedriver’driver = (executable_path=chrome_path, options=chrome_options)# Run the Webdriver, save page an quit (“:/”)
Web Scraping using Selenium and Python – ScrapingBee
●
Updated:
08 July, 2021
9 min read
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.
Today we are going to take a look at Selenium (with Python ❤️) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago! ) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
Clicking on buttons
Filling forms
Scrolling
Taking a screenshot
It is also useful for executing Javascript code. Let’s say that you want to scrape a Single Page Application. Plus you haven’t found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
Chrome download page
Chrome driver binary
selenium package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
from selenium import webdriver
DRIVER_PATH = ‘/path/to/chromedriver’
driver = (executable_path=DRIVER_PATH)
(”)
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).
You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
from import Options
options = Options()
options. headless = True
d_argument(“–window-size=1920, 1200”)
driver = (options=options, executable_path=DRIVER_PATH)
(“)
print(ge_source)
()
The ge_source will return the full page HTML code.
Here are two other interesting WebDriver properties:
gets the page’s title
rrent_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
Tag name
Class name
IDs
XPath
CSS selectors
We recently published an article explaining XPath. Don’t hesitate to take a look if you aren’t familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.
A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.
Let’s say that we want to locate the h1 tag in this HTML:
Super title
h1 = nd_element_by_name(‘h1’)
h1 = nd_element_by_class_name(‘someclass’)
h1 = nd_element_by_xpath(‘//h1’)
h1 = nd_element_by_id(‘greatID’)
All these methods also have find_elements (note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
all_links = nd_elements_by_tag_name(‘a’)
Some elements aren’t easily accessible with an ID or a simple class, and that’s when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It’s a powerful way to extract any element on a page, based on it’s absolute position on the DOM, or relative to another element.
WebElement
A WebElement is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property
Clicking on the element with ()
Accessing an attribute with t_attribute(‘class’)
Sending text to an input with: nd_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:
This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That’s a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
Go to the login page using ()
Select the username input using nd_element_by_* and then nd_keys() to send text to the input
Follow the same process with the password input
Click on the login button using ()
Should be easy right? Let’s see the code:
login = nd_element_by_xpath(“//input”). send_keys(USERNAME)
password = nd_element_by_xpath(“//input[@type=’password’]”). send_keys(PASSWORD)
submit = nd_element_by_xpath(“//input[@value=’login’]”)()
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.
So, we’re going to check for the logout button. The logout button has the ID “logout” (easy)!
We can’t just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.
So we have to use a try/except block and catch the NoSuchElementException exception:
# dont forget from import NoSuchElementException
try:
logout_button = nd_element_by_id(“logout”)
print(‘Successfully logged in’)
except NoSuchElementException:
print(‘Incorrect login/password’)
We could easily take a screenshot using:
ve_screenshot(”)
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.
Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it’s simple and we don’t have to worry about these issues.
If you need to make screenshots at scale, feel free to try our new Screenshot API here.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
Use a (ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.
If you use a () you will probably use an arbitrary value. The problem is, you’re either waiting for too long or not enough.
Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.
element = WebDriverWait(driver, 5)(
esence_of_element_located((, “mySuperId”)))
finally:
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.
There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let’s say you want to take a screenshot of some information, but you first need to scroll a bit to see it.
You can easily do this with Selenium:
javaScript = “rollBy(0, 1000);”
driver. execute_script(javaScript)
Using a proxy with Selenium Wire
Unfortunately, Selenium proxy handling is quite basic. For example, it can’t handle proxy with authentication out of the box.
To solve this issue, you need to use Selenium Wire.
This package extends Selenium’s bindings and gives you access to all the underlying requests made by the browser.
If you need to use Selenium with a proxy with authentication this is the package you need.
pip install selenium-wire
This code snippet shows you how to quickly use your headless browser behind a proxy.
# Install the Python selenium-wire library:
# pip install selenium-wire
from seleniumwire import webdriver
proxy_username = “USER_NAME”
proxy_password = “PASSWORD”
proxy_url = ”
proxy_port = 8886
options = {
“proxy”: {
“”: f”{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}”,
“verify_ssl”: False, }, }
URL = ”
driver = (
executable_path=”YOUR-CHROME-EXECUTABLE-PATH”,
seleniumwire_options=options, )
(URL)
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
chrome_options = romeOptions()
### This blocks images and javascript requests
chrome_prefs = {
“fault_content_setting_values”: {
“images”: 2,
“javascript”: 2, }}
chrome_options. experimental_options[“prefs”] = chrome_prefs
###
chrome_options=chrome_options, )
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don’t hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn’t have an API, it’s maybe* a good idea to automate it with Selenium, just don’t forget this xkcd:
Intro to automation and web Crawling with Selenium – Medium
Learn how to use Selenium and Python to scrap and interact with any WebsitesIn this in depth tutorial series, you will learn how to use Selenium + Python to crawl and interact with almost any specifically, you’ll learn how to:Make requests and select elements using CSS selectors and XPath — Tutorial Part 1Login to any web platform — Tutorial Part 2Pro tips and crawl in practice — Tutorial Part 3Selenium is a Web Browser Automation Tool originally designed to automate web applications for testing purposes. It is now used for many other applications such as automating web-based admin tasks, interact with platforms which do not provide Api, as well as for Web are many reasons to choose Selenium when crawling. Here are some reasons:Supports many languages: Python, Java, Python, C#, PHP, Ruby…Supports javascript: so you can access more information on the page, and simulate behaviours that are close to human can be integrated: with Maven, Jenkins & Docker so it is easy to productionise your scriptsOn the other side, Selenium has some drawbacks compared to regular (non-js) crawlers like scrapy, requests, urllib in Python. More specifically, it needs more ressource, is slower, and is difficult to is therefore always advisable to use Selenium if speed is not an issue, and use it on the most complex sites to portant note: Scraping is against some websites’ terms of service. Please read the website terms of service before this tutorial, we will use Python 3. x. You can also use Python 2. 7, but some parts of the code may require slight changes — check out especially the stall dependenciesFirst you will need to create your own virtual environment and install Selenium Python module. If you need to install virtual environment, please follow the rtualenv selenium_examplesource selenium_example/bin/activatepip install seleniumInstall Chrome DriverSecond, you need to install the Google Chrome Driver. Click here to Download the latest DriverNB: Selenium also supports Firefox and Safari, but Chrome is most popular among developers and most a script and start importing the necessary the packagesLet’s now load our essential dependencies for this tutorial! from selenium import webdriverfrom import OptionsThe first line import the Web Driver, and the second import Chrome OptionsSelenium offers many options such as:The window sizeBrowse in incognito modeUse proxiesIn this tutorial we will browse in incognito mode and set up the window-size to 1920–1080. You’ll learn how to use Proxies in the last rome_options = Options()d_argument(“–incognito”)d_argument(“–window-size=1920×1080”)Create your instancedriver = (chrome_options=chrome_options, executable_path=
Frequently Asked Questions about web crawler python selenium
Is Selenium a web crawler?
Selenium is a Web Browser Automation Tool originally designed to automate web applications for testing purposes. It is now used for many other applications such as automating web-based admin tasks, interact with platforms which do not provide Api, as well as for Web Crawling.Jan 13, 2019
How do I use Python web scraping in Selenium?
Implementation of Image Web Scrapping using Selenium Python: –Step1: – Import libraries. … Step 2: – Install Driver. … Step 3: – Specify search URL. … Step 4: – Scroll to the end of the page. … Step 5: – Locate the images to be scraped from the page. … Step 6: – Extract the corresponding link of each Image.More items…•Aug 30, 2020
Can we use Selenium for Web scraping?
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable.