Selenium Web Crawler

March 20, 2022
0

Build a scalable web crawler with Selenium and Python

Implementation within the Google Cloud Platform by using Docker, Kubernetes Engine and Cloud DatastoreFig. 1 — Image from Pixabay (Pixabay License)Disclaimer: Since scraping of Services is prohibited by the terms of use, I would like to point out that we immediately processed the underlying data within the project with NLP and no storing of the pure texts took place. The approach illustrated in this article is therefore for demonstration purposes only and can be used for other websites that allow web article is part of a larger project. If you are also interested in performing Natural Language Processing on the results to extract technology names by using PySpark and Kubernetes or building highly scalable Dashboards in Python, you will find corresponding links at the end of the troductionProject Idea and approachSource Inspection and PackagesImplementation StepsResultsLife as a Data Scientist can be tough. It is not only the acquisition and quality of data and its interpretability that poses challenges. The rapid development of technologies, as well as constantly rising expectations from business (keyword rocket science), also make the work more difficult. However, in my experience, the acquisition and application of new technologies, in particular, is a source of enthusiasm for most data scientists. For this reason, I built a scalable web crawler with common technologies to improve my files and code snippets that are referenced in this article can be found in my GitHub wards Data Science (TWDS) is one of the best known and most instructive places to go for data science. It is a publication on which a large number of authors have published various articles. Recurrently used technologies are referenced and their use is often presented in case erefore I decided to build a web crawler that extracts the content of TWDS and stores it inside the NoSQL database “Google Datastore”. To make the web crawler scalable, I used Docker for containerizing my application and Kubernetes for the 2 — Technical Overview of the scalable infrastructureThe approach was to develop the web crawler in a Jupyter Notebook on my local machine and to constantly professionalize and increase the project (see Fig 2). For instance, I built a Python application with a dedicated crawler class and all necessary methods based on the Jupyter Notebook scripts. But let us have a more detailed look at the implementation steps. 3. 1 Source inspectionTo develop a properly operating web crawler, it is important to familiarize yourself in advance with the site structure, available content, and 3 — Connection of the relevant entitiesTWDS is a classic publication with many authors and a lot of articles. Thanks to an archive page it was easy to understand the page structure in detail (see Fig. 3). Fortunately, the authors were not only listed there but also provided with links that led to overview pages for these 4 — Page source-code for authors list on TWDS-ArchiveThe used HTML class was constantly used so that the links could easily be identified (see Fig. 4) the overview pages of the authors, I figured out that at first only the author’s articles published on TWDS were listed. Other articles published on by the author were not displayed. It was therefore not necessary to check whether the specific article belonged to the TWDS publication. Unfortunately, the HTML class for these links was empty and the links could not be identified. However, the links contained the complete URL and thus the word “towards”. Therefore, the identification of these links was just as unambiguous. However, another challenge occurred when examining the page. Not all of the author’s articles were displayed directly, but when the website was scrolled down further content was dynamically reloaded using Javascript. To ensure completeness, this had to be taken into account for the development of the web 5 — Example of the HTML source code from a TWDS-ArticleFinally, I had to examine the structure of the individual articles for similarities and patterns to extract the relevant data fields. The required properties were author, URL, title, text, reading time, publishing date, tags, claps and the number of responses. As can be seen in Figure 5, the HTML source code has some challenges. For example, the class names are seemingly dynamically generated and have only minor matches across articles. But there are also rays of hope, e. g. reading time, title, URL, and publishing date are standardized in the page header. The remaining content was reasonably easy to access. 2 Package SelectionAt first, during development in Jupyter Notebooks, I was looking for Python packages I could use to fulfill all requirements. I quickly realized for Scrapy, one of the most commonly used packages for web scraping, that dynamic content reloading would be difficult. After focusing on this requirement, I became aware of Selenium. Selenium is a framework for automated software testing of web applications and can interact with browsers, e. to scroll down pages to load the dynamic javascript content and receive the full HTML source work with the extracted HTML source code, I found the Python package BeautifulSoup4, which provides various methods to systematically search the HTML tree structure for relevant content. With these packages selected, I could fulfill all the requirements to develop a web crawler. 4. 1 Development of a Python-based web crawlerDuring the development, I now worked along with the page structure shown in figure 3. So I started with the extraction of the author list. I defined the URL “ to be crawled and used to start the Selenium Webdriver. In the following, I extracted all required parts of the code to run the Selenium Webdriver. # Importfrom selenium import webdriverfrom import Options# Define Browser Optionschrome_options = Options()d_argument(“–headless”) # Hides the browser window# Reference the local Chromedriver instancechrome_path = r’/usr/local/bin/chromedriver’driver = (executable_path=chrome_path, options=chrome_options)# Run the Webdriver, save page an quit (“:/”)htmltext = ()Since the command “()” only opens the browser and loads the referenced page, I further used a code snippet that automatically scrolled the page down to the end and thus allowed saving the complete HTML source code (“ge_source”). #importsimport time# Scroll page to load whole contentlast_height = driver. execute_script(“return “)while True: # Scroll down to the bottom. driver. execute_script(“rollTo(0, );”) # Wait to load the page (2) # Calculate new scroll height and compare with last height. new_height = driver. execute_script(“return “) if new_height == last_height: break last_height = new_heightThis snippet is completely independent of any website specific structure and can be easily reused in another web crawling context as the output is still only the HTML source code and I was looking for a list of all authors, I wrote a “for loop” to extract the links to the authors’ profiles by using my knowledge from source inspection (see chapter 3. 1). # Parse HTML structuresoup = BeautifulSoup(htmltext, “lxml”)# Extract links to profiles from TWDS Authorsauthors = []for link in nd_all(“a”, class_=”link link–darker link–darken u-accentColor–textDarken u-baseColor–link u-fontSize14 u-flex1″): ((‘href’))Fig. 6 — Output of authors listThe result was now a list with links to the respective authors that could be easily further exploited (see fig. 6). I used the list as the input for my next iteration to receive the articles for each author. As a result, I stored the links of the articles and the link to the authors’ profile page as a key-value pair inside a dictionary (see fig 7) 7 — Example of the key-value pairs for the extracted articlesWith the links to the articles in access, I iterated over the different articles, extracted the relevant field contents and stored them inside a dictionary (tempdic). In some cases, this was done simply by specifying the location in the HTML structure. # Extract field values and store them in jsontempdic = {}tempdic[‘Article_ID’] = (“meta”, attrs={“name”: “parsely-post-id”})[“content”]tempdic[‘Title’] = [‘Author’] = (“meta”, attrs={“name”: “author”})[“content”]In other cases, the use of loops or regular expressions was necessary, e. by extracting the tags. # Loop to extract tagsli = (“ul > li > a”)tags = []for link in li: ()tempdic[‘Tags’] = tagsSince I could now store the data of an article systematically in a dictionary, I had to find a suitable way to store the data. 2 Storing Data in Google Cloud DatastoreAs I already had a perfectly filled dictionary per article and my focus was not supposed to prepare a fitting SQL-Database, I was choosing the Google Datastore to store my data. The Google Datastore is a managed, NoSQL, schemaless database for storing non-relational data — just perfect for this use use Google Datastore, it is necessary to set up a project at Google Cloud Platform (How to set up a Google Cloud Project; Of course other cloud providers can be used instead). To access Google Datastore by using Python, it is likely to set up a service account with access rights to the Datastore (Role: Cloud Datastore-Owner) inside the project. This can be done in the menu path “API & Services > Credentials” by generating an usage of the generated connection data is easiest when calling the data from a JSON-file. How this can be generated can be seen from the following the web crawler source code, the connection has to be initialized first. The JSON-file is hereby referenced (“”) initializeGDS(self): global credentials global client print(“Setup Database Connection”) credentials = edentials() # Service account client = (”)After adding all relevant information, the entity can finally be stored in ticle = ((‘Article_ID’, str_articlenumber), exclude_from_indexes=[‘Text’])({ “URL”: str_URL, “Title”: str_title, “Author”: str_author, “PublishingDate”: str_pubdate, “Text”: str_text, “Claps”: int_claps, “Tags”: Tag_list, “No_Responses”: int_responses, “Reading_time”: int_reading_time})(Article)The functionality of the web crawlers is now completed. As the implementation is still running inside Jupyter Notebook, it is now time for refactoring the code and using a crawler class with specified methods (see)4. 3 Containerize your application with DockerAs Docker is the most relevant container platform in software development and part of many implementations, I will not explain any further background within this article. Nevertheless, this was my first use of Docker and I had a look for a convenient step-by-step tutorial to containerize my Python application that is likely to build the first container image, I only used these four files (GitHub-Repository):Application-file: In our case The JSON-file you generated in the section above with the connection details to your GCP-projectDockerfile: This file contains all the commands a user could call on the command line to assemble an Specifies the used Python packagesTo build the container image, it necessary to enter the directory folder with the referenced files inside the shell and write the following command:docker build -t twds-crawler just specified the name of the container image to “twds-crawler” and placed the image in the current directory folder (“. ”). To run the container the following command should be used:docker run twds-crawlerDue to the pre-configured Dockerfile, the Python application inside the container starts automatically after the container is running. The output should look somehow like:Fig. 8 — Output from Docker run commandThe web crawler application started (“Start Crawler”) and opened the getAuthors method (“Get Authors”) but crashed afterward due to the missing browser instance. For now, this can be ignored as the goal is to run this container inside a Kubernetes cluster. 4 Run a Kubernetes Cluster on Google Cloud PlatformKubernetes is an open-source system for automating the deployment, scaling, and management of (docker-)container applications. As it was developed by Google, the Google Cloud Platform delivers a nice implementation so that you can build a cluster only by using the Google Cloud Shell inside the browser and the following script. Just replace with the name of your Google Cloud Platform 9 — Settings of Google Cloud ShellNote: I would recommend using the editor modus to show all stored files. # Define project variableexport PROJECT_ID=# Start Clustergcloud beta container — project ${PROJECT_ID} clusters create “twdscrawler” — zone “us-central1-a” — no-enable-basic-auth — cluster-version “1. 13. 11-gke. 14” — machine-type “n1-standard-1” — image-type “COS” — disk-type “pd-standard” — disk-size “100” — metadata disable-legacy-endpoints=true — scopes “, “, “, “, “, ” — num-nodes “2” — enable-cloud-logging — enable-cloud-monitoring — enable-ip-alias — network “projects/${PROJECT_ID}/global/networks/default” — subnetwork “projects/${PROJECT_ID}/regions/us-central1/subnetworks/default” — default-max-pods-per-node “110” — enable-autoscaling — min-nodes “2” — max-nodes “8” — addons HorizontalPodAutoscaling, HttpLoadBalancing — no-enable-autoupgrade — enable-autorepairTo access the cluster from the shell after the deployment finished you simply use the following command:gcloud container clusters get-credentials twdscrawler — zone us-central1-a — project The created Kubernetes Cluster has auto-scaling and uses a minimum of 2 nodes and a maximum of 8 nodes (Note: To save some money, make sure to delete the cluster after using it, see main menu point “Kubernetes Engine”) are now ready to deploy the selenium grid and our containerized web crawler. 5 Selenium Grid on KubernetesThe Selenium Grid is a hub/nodes construction of Selenium with potentially heterogeneous browser versions (nodes) and a control unit (hub) that distributes or parallelizes the work items e. unit tests or crawling jobs. To connect both objects there is also a Hub-Service. For a more detailed description check this make the deployment process as easy as possible and reduce the necessary code to a minimum, I used YAML-Files and bash scripts. YAML-Files describe Kubernetes objects, e. in case of the nodes the number of different Selenium nodes to be deployed or the specific browser version. The bash scripts call the different YAML-Files in the right work inside the Google Cloud shell it necessary to upload the different files. This can easily be done by drag and drop. The following files have to be in there ( needs to be added individually, the rest can be found in my GitHub-Repository) 10 — Required Documents within the Cloud ShellBy using the following command, a complete Selenium Grid with one Firefox-node will be deployed on the Kubernetes Cluster:bash check if everything is working following command could be used:kubectl get podsFig. 11 — Overview of pods running on Kubernetes4. 6 Web crawler on KubernetesSince the Selenium Grid with a Firefox node is already running on the Kubernetes Cluster, it is time to go on with the web crawler. Due to the local development of the web crawler as well as the use of the local web browser, it is necessary to adjust the Webdriver to the Selenium Grid:# Define Remote Webdriverdriver = (command_executor=’selenium-hub:4444/wd/hub’, desired_capabilities=getattr(DesiredCapabilities, “FIREFOX”))Note: The adjusted version can be found in my GitHub-Repository. Just replace the code of the or change the referenced file inside the Dockerfile to “” this change, a new Docker image can be built inside the Google Cloud Shell and published into the Google Cloud Container Registry (comparable to a repository). This can be done with the following commands:export PROJECT_ID=docker build -t ${PROJECT_ID}/twds-crawler push ${PROJECT_ID}/twds-crawlerIf everything worked fine, the web crawler can finally be deployed inside the Kubernetes Cluster withbash check if the crawler runs and see the logs (e. the printed lines) you can use the following commands inside the Google Cloud Shell:kubectl get podsFig. 12 — Overview of the pods running on Kubernetes with Crawlerkubectl logs Fig. 13 — Output of the log for the Crawler podThe web crawler is now running. To increase the number of nodes, the YAML File for the Firefox-node has to be edited upfront, or during run time with the following command:kubectl scale deployment selenium-node-firefox –replicas=10The Selenium Grid will automatically use the deployed Firefox-node instances during the web crawling everything worked fine, the results should be visible inside the Google Cloud Datastore just moments later as I chose an incremental approach to write the article details inside the 14 — Overview of the results for the entity Article_ID in Google Cloud DatastoreHope you enjoyed reading my article and good luck with your you have any problems by setting up the project, please also have a look at the troubleshooting area in my lated articles:To see how to perform Natural Language Processing on the results and extract technology names by using PySpark and Kubernetes, please have a look at the project of Jürgen see how to build a highly scalable Python Dashboard that runs on Kubernetes as well, please have a look at the project of Arnold Lutsch.
Intro to automation and web Crawling with Selenium - Medium

Intro to automation and web Crawling with Selenium – Medium

Learn how to use Selenium and Python to scrap and interact with any WebsitesIn this in depth tutorial series, you will learn how to use Selenium + Python to crawl and interact with almost any specifically, you’ll learn how to:Make requests and select elements using CSS selectors and XPath — Tutorial Part 1Login to any web platform — Tutorial Part 2Pro tips and crawl in practice — Tutorial Part 3Selenium is a Web Browser Automation Tool originally designed to automate web applications for testing purposes. It is now used for many other applications such as automating web-based admin tasks, interact with platforms which do not provide Api, as well as for Web are many reasons to choose Selenium when crawling. Here are some reasons:Supports many languages: Python, Java, Python, C#, PHP, Ruby…Supports javascript: so you can access more information on the page, and simulate behaviours that are close to human can be integrated: with Maven, Jenkins & Docker so it is easy to productionise your scriptsOn the other side, Selenium has some drawbacks compared to regular (non-js) crawlers like scrapy, requests, urllib in Python. More specifically, it needs more ressource, is slower, and is difficult to is therefore always advisable to use Selenium if speed is not an issue, and use it on the most complex sites to portant note: Scraping is against some websites’ terms of service. Please read the website terms of service before this tutorial, we will use Python 3. x. You can also use Python 2. 7, but some parts of the code may require slight changes — check out especially the stall dependenciesFirst you will need to create your own virtual environment and install Selenium Python module. If you need to install virtual environment, please follow the rtualenv selenium_examplesource selenium_example/bin/activatepip install seleniumInstall Chrome DriverSecond, you need to install the Google Chrome Driver. Click here to Download the latest DriverNB: Selenium also supports Firefox and Safari, but Chrome is most popular among developers and most a script and start importing the necessary the packagesLet’s now load our essential dependencies for this tutorial! from selenium import webdriverfrom import OptionsThe first line import the Web Driver, and the second import Chrome OptionsSelenium offers many options such as:The window sizeBrowse in incognito modeUse proxiesIn this tutorial we will browse in incognito mode and set up the window-size to 1920–1080. You’ll learn how to use Proxies in the last rome_options = Options()d_argument(“–incognito”)d_argument(“–window-size=1920×1080”)Create your instancedriver = (chrome_options=chrome_options, executable_path=)chrome_options: are the options defined aboveyour_exec_path should point at where you downloaded the chrome dependencies. If you have not downloaded it yet: check Download the latest DriverSelenium result when creating your chrome instanceYou should then see a screen like this that the instance is in Incognito mode and “Chrome is being controlled by automated test software” this example, we will use Selenium to get the news title on Hacker = “(url)To access a url, the command is “(url)”; How simple is that? You should then see this screen:Get on hackernews! Wait for the responseJavascript is asynchronous by nature, so some elements may not be fully loaded and visible right away. In practice, it is therefore advisable to add some delay before getting the timeurl = “(url)(2)In this case (), we decided to to pause for 2 seconds before analysing our tip: There are more complex techniques to ensure an element is visible such as waiting for it: see lenium offers a few ways to access elements on the page (see official source) methods I often use are:Elements by id: in this case, you’ll need to check the code source using the console and find the id of the elementElements by css_selector: css selector is a very powerful way to select elements on a page. I recommend using the Selector Gadget Extension to get the tags. Elements by Xpath: is a query language for selecting nodes from an XML document. It is also frequently used to select elements on a page##Find elements with Selenium#by_idels = nd_elements_by_id(elementId)#by cssels = nd_elements_by_css_selector(element_css_selector)#by xpathels = nd_elements_by_xpath(element_x_path)Css Selector vs XpathCSS Selectors often perform better than Xpath and is well documented in Selenium community. The two main reasons:Xpath can be complex to readXpath engines are different in each browser, making them erefore I mostly use CSS selector when lector gadget: is a very powerful extension that help you find the “css tags” applied to elements. It is intuitive, reliable and also provides or multiple elements? With Selenium you can either select a single or multiple elements. I personally always chose to find all the elements on a page — and get the first one if, instead, you decide to use “find_element” (vs “find_elements”) and several elements match your criteria, Selenium will return the first Apply to Hacker News! Selector Gadget applied to Hacker NewsUsing Selector gadget we have the following CSS Selector tags to get our elements. elements = nd_elements_by_css_selector(“. storylink”)Selenium returns objects which you can then query. For instance, if you want to get:the displayed text: el. textthe href url: t_attribute(“href”)the src: t_attribute(“src”)For example, if we want to get the “href” attribute, we will call the t_attribute(“href”)So if we want to get the text titles and the urls of the articles with Selenium:elements = nd_elements_by_css_selector(“. storylink”)storyTitles = [ for el in elements]storyUrls = [t_attribute(“href”) for el in elements]Similarly if you want to get the score and the domain of each of the article:elements = nd_elements_by_css_selector(“”)scores = [ for el in elements]elements = nd_elements_by_css_selector(“. sitebit a”)sites = [t_attribute(“href”) for el in elements]from selenium import webdriverfrom import Optionsimport timechrome_options = Options()d_argument(“–incognito”)d_argument(“–window-size=1920×1080”)driver = (chrome_options=chrome_options, executable_path=your_exec_path)url = “(url)(3)elements = nd_elements_by_css_selector(“. storylink”)storyTitles = [ for el in elements]storyUrls = [t_attribute(“href”) for el in elements]elements = nd_elements_by_css_selector(“”)scores = [ for el in elements]elements = nd_elements_by_css_selector(“. sitebit a”)sites = [t_attribute(“href”) for el in elements]Selenium also offers more advanced controls such as clicking, insert text inputs (…) which are extremely powerful when crawling more complicated sites.
Intro to automation and web Crawling with Selenium - Medium

Intro to automation and web Crawling with Selenium – Medium

Frequently Asked Questions about selenium web crawler

Is selenium a web crawler?

Selenium is a Web Browser Automation Tool originally designed to automate web applications for testing purposes. It is now used for many other applications such as automating web-based admin tasks, interact with platforms which do not provide Api, as well as for Web Crawling.Jan 13, 2019

Is selenium good for web scraping?

Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same. Now let us see how to use selenium for Web Scraping.Aug 30, 2020

What is the difference between BeautifulSoup and selenium?

Comparing selenium vs BeautifulSoup allows you to see that BeautifulSoup is more user-friendly and allows you to learn faster and begin web scraping smaller tasks easier. Selenium on the other hand is important when the target website has a lot of java elements in its code.Feb 10, 2021

ProxyBoys