Web Scraping With Java Selenium
Web scraping with Selenium – Towards Data Science
WEB SCRAPING SERIESHands-on with SeleniumSelenium is a portable framework for testing web applications. It is open-source software released under the Apache License 2. 0 that runs on Windows, Linux and macOS. Despite serving its major purpose, Selenium is also used as a web scraping tool. Without delving into the components of Selenium, we shall focus on a single component that is useful for web scraping, WebDriver. Selenium WebDriver provides us with an ability to control a web browser through a programming interface to create and execute test our case, we shall be using it for scraping data from websites. Selenium comes in handy when websites display content dynamically i. e. use JavaScripts to render content. Even though Scrapy is a powerful web scraping framework, it becomes useless with these dynamic websites. My goal for this tutorial is to make you familiarize with Selenium and carry out some basic web scraping using us start by installing selenium and a webdriver. WebDrivers support 7 Programming Languages: Python, Java, C#, Ruby, PHP, and Perl. The examples in this manual are with Python language. There are tutorials available on the internet with other is the third part of a 4 part tutorial series on web scraping using Scrapy and Selenium. The other parts can be found atPart 1: Web scraping with Scrapy: Theoretical UnderstandingPart 2: Web scraping with Scrapy: Practical UnderstandingPart 4: Web scraping with Selenium & ScrapyInstalling SeleniumInstalling Selenium on any Linux OS is easy. Just execute the following command in a terminal and Selenium would be installed install seleniumInstalling WebDriverSelenium officially has WebDrivers for 5 Web Browsers. Here, we shall see the installation of WebDriver for two of the most widely used browsers: Chrome and stalling Chromedriver for ChromeFirst, we need to download the latest stable version of chromedriver from Chrome’s official site. It would be a zip file. All we need to do is extract it and put it in the executable sudo mv chromedriver /usr/local/bin/Installing Geckodriver for FirefoxInstalling geckodriver for Firefox is even simpler since it is maintained by Firefox itself. All we need to do is execute the following line in a terminal and you are ready to play around with selenium and apt install firefox-geckodriverThere are two examples with increasing levels of complexity. First one would be a simpler webpage opening and typing into textboxes and pressing key(s). This example is to showcase how a webpage can be controlled through Selenium using a program. The second one would be a more complex web scraping example involving mouse scrolling, mouse button clicks and navigating to other pages. The goal here is to make you feel confident to start web scraping with Selenium. Example 1 — Logging into Facebook using SeleniumLet us try out a simple automation task using Selenium and chromedriver as our training wheel exercise. For this, we would try to log into a Facebook account and we are not performing any kind of data scraping. I am assuming that you have some knowledge of identifying HTML tags used in a webpage using the browser’s developer tools. The following is a piece of python code that opens up a new Chrome browser, opens the Facebook main page, enters a username, password and clicks Login selenium import webdriverfrom import Keysuser_name = “Your E-mail”password = “Your Password”# Creating a chromedriver instancedriver = () # For Chrome# driver = refox() # For Firefox# Opening facebook (“)# Identifying email and password textboxesemail = nd_element_by_id(“email”)passwd = nd_element_by_id(“pass”)# Sending user_name and password to corresponding nd_keys(user_name)nd_keys(password)# Sending a signal that RETURN key has been nd_keys()# ()After executing this python code, your Facebook homepage would open in a new Chrome browser window. Let us examine how this became all starts with the creation of a webdriver instance for your browser. As I am using Chrome, I have used driver = () we open the Facebook webpage using (“). When python encounters (URL), it opens a new browser window and opens the webpage specified by the the homepage is loaded, we identify the textboxes to type e-mail and password using their HTML tag’s id attribute. This is done using nd_element_by_id() send the username and password values for logging into Facebook using send_keys() then simulate the user’s action of pressing RETURN/ENTER key by sending its corresponding signal using send_keys(). IMPORTANT NOTE:Any instance created in a program should be closed at the end of the program or after its purpose is served. So, whenever we are creating a webdriver instance, it has to be terminated using (). If we do not terminate the opened instances, it starts to use up RAM, which may impact the machine’s performance and slow it down. In the above example, this termination process has been commented out to show the output in a browser window. And, if terminated, the browser window would also be closed and the reader would not be able to see the output. Example 2 — Scraping Pollution data from OpenAQThis is a more complex example. OpenAQ is a non-profit organization that collects and shares air quality data that are open and can be accessed in many ways. This is evident from the site’s * Disallow:Our goal here is to collect data on PM2. 5 readings from all the countries listed on. PM2. 5 are the particulate matter (PM) that have a diameter lesser than 2. 5 micrometres, which is way smaller than the diameter of a human hair. If the reader is interested in knowing more about PM2. 5, please follow this reason for choosing Selenium over Scrapy is that uses React JS to render data. If it were static webpages, Scrapy would scrape the data efficiently. To scrape data, we first need to analyze the website, manually navigate the pages and note down the user interaction steps required to extract data. Understanding layoutIt is always better to scrape with as few webpage navigations as possible. The website has a webpage which could be used as a starting point for filter locations option on the left-side panel is used to filter out PM2. 5 data for each country. The Results on the right-side panel show cards that open a new page when clicked to display PM2. 5 and other data. A sample page containing PM2. 5 data is shown below. From this page, we can extract PM2. 5 values, location, city, country, date and time of recording PM2. 5 value using XPATH or reenshot from showing PM2. 5 value after clicking the location from the previous imageSimilarly, the left-side panel can be used to filter out and collect URLs of all the locations that contain PM2. 5 data. The following are the actions that we performed manually to collect the the left-side panel, select/click checkbox of a country. Let us go through the countries, from the left-side panel, select/click checkbox for the cards to load in the right-side panel. Each card would then open a new webpage when clicked to display PM2. 5 and other needed to collect PM2. 5 dataBased on the manual steps performed, data collection from is broken down to 3 llecting country names as displayed on OpenAQ countries webpage. This would be used in selecting appropriate checkboxes while llecting URLs that contain PM2. 5 data from each country. Some countries contain more than 20 PM2. 5 readings collected from various locations. It would require further manipulation of the webpage, which is explained in the code ing up webpages of the individual URL and extracting PM2. 5 raping PM2. 5 dataNow that we have the steps needed, let us start to code. The example is divided into 3 functions, each performing the task corresponding to the aforementioned 3 steps. The python code for this example can be found in my GitHub t_countries()Instead of using OpenAQ locations webpage, there is webpage, which displays all the countries at once. It is easier to extract country names from this reenshot from showing list of countriesfrom selenium import webdriverfrom import Byfrom import WebDriverWaitfrom pport import expected_conditions as ECimport jsondef get_countries():countries_list = []# driver = () # To open a new browser window and navigate it# Use the headless option to avoid opening a new browser window options = romeOptions() d_argument(“headless”) desired_capabilities = _capabilities() driver = (desired_capabilities=desired_capabilities)# Getting webpage with the list of (“)# Implicit wait plicitly_wait(10)# Explicit wait wait = WebDriverWait(driver, 5) (esence_of_element_located((ASS_NAME, “card__title”))) countries = nd_elements_by_class_name(“card__title”) for country in countries: ()()# Write countries_list to json file with open(“”, “w”) as f: (countries_list, f)Let us understand how the code works. As always, the first step is to instantiate the webdriver. Here, instead of opening a new browser window, the webdriver is instantiated as a headless one. This way, a new browser window will not be opened and the burden on RAM would be reduced. The second step is to open the webpage containing the list of countries. The concept of wait is used in the above plicit Wait: When created, is alive until the WebDriver object dies. And is common for all operations. It instructs the webdriver to wait for a certain amount of time before elements load on the webpage. Explicit Wait: Intelligent waits that are confined to a particular web element, in this case, tag with class name “card__title”. It is generally used along with third step is to extract the country names using the tag with class name “card__title”. Finally, the country names are written to a JSON file for persistence. Below is a glimpse of the JSON [“Afghanistan”, “Algeria”, “Andorra”, “Antigua and Barbuda”,… ]get_urls()The next step after getting the list of countries is to get the URLs of every location that records PM2. To do this, we need to open the OpenAQ locations webpage and make use of the left-side panel to filter out countries and PM2. Once it is filtered, the right-side panel would be populated with cards to individual locations that record PM2. We extract the URLs corresponding to each of these cards and eventually write them to a file that would be used in the next step of extracting PM2. Some countries have more than 20 locations that record PM2. For example, Australia has 162 locations, Belgium has 69 locations, China has 1602 locations. For these countries, the right-side panel on locations webpage is subdivided into pages. It is highly imperative that we navigate through these pages and collect URLs of all the locations. The code below has a while TRUE: loop that performs this exact task of page selenium import webdriverfrom import Byfrom import WebDriverWaitfrom pport import expected_conditions as ECfrom import ActionChainsfrom logzero import loggerimport as exceptionimport timeimport jsondef get_urls():# Load the countries list written by get_countries() with open(“”, “r”) as f: countries_list = (f) # driver = () # Use headless option to not open a new browser window options = romeOptions() d_argument(“headless”) desired_capabilities = _capabilities() driver = (desired_capabilities=desired_capabilities)urls_final = [] for country in countries_list:# Opening locations webpage (“) plicitly_wait(5) urls = []# Scrolling down the country filter till the country is visible action = ActionChains(driver) ve_to_element(nd_element_by_xpath(“//span[contains(text(), ” + ‘”‘ + country + ‘”‘ + “)]”)) rform()# Identifying country and PM2. 5 checkboxes country_button = nd_element_by_xpath(“//label[contains(@for, ” + ‘”‘ + country + ‘”‘ + “)]”) values_button = nd_element_by_xpath(“//span[contains(text(), ‘PM2. 5′)]”) # Clicking the checkboxes () (2) () (2)while True: # Navigating subpages where there are more PM2. For example, Australia has 162 PM2. 5 readings from 162 different locations that are spread across 11 subpages. locations = nd_elements_by_xpath(“//h1[@class=’card__title’]/a”)for loc in locations: link = t_attribute(“href”) (link)try: next_button = nd_element_by_xpath(“//li[@class=’next’]”) () except SuchElementException: (f”Last page reached for {country}”) (f”{country} has {len(urls)} PM2. 5 URLs”) (urls)(f”Total PM2. 5 URLs: {len(urls_final)}”) ()# Write the URLs to a file with open(“”, “w”) as f: (urls_final, f)It is always a good practice to log the output of programs that tend to run longer than 5 minutes. For this purpose, the above code makes use of logzero. The output JSON file containing the URLs looks like [ “, “, “,… ]get_pm_data()The process of getting PM2. 5 data from the individual location is a straight forward web scraping task of identifying the HTML tag containing the data and extracting it with text processing. The same happens in the code provided below. The code extracts the country, city, location, PM2. 5 value, URL of the location, date and time of recording PM2. 5 value. Since there are over 5000 URLs to be opened, there would be a problem with RAM usage unless the RAM installed is over 64GB. To make this program to run on machines with minimum 8GB of RAM, the webdriver is terminated and re-instantiated every 200 selenium import webdriverfrom import Byfrom import WebDriverWaitfrom pport import expected_conditions as ECfrom import ActionChainsfrom logzero import loggerimport as exceptionimport timeimport jsondef get_pm_data():# Load the URLs list written by get_urls() with open(“”, “r”) as f: urls = (f)# Use headless option to not open a new browser window options = romeOptions() d_argument(“headless”) desired_capabilities = _capabilities() driver = (desired_capabilities=desired_capabilities)list_data_dict = [] count = 0for i, url in enumerate(urls): data_dict = {}# Open the webpage corresponding to each URL (url) plicitly_wait(10) (2)try: # Extract Location and City loc = nd_element_by_xpath(“//h1[@class=’inpage__title’]”)(“\n”) (f”loc: {loc}”) location = loc[0] city_country = loc[1]. replace(“in “, “”, 1)(“, “) city = city_country[0] country = city_country[1] data_dict[“country”] = country data_dict[“city”] = city data_dict[“location”] = locationpm = nd_element_by_xpath(“//dt[text()=’PM2. 5′]/following-sibling::dd[1]”) pm is not None: # Extract PM2. 5 value, Date and Time of recording split = (“µg/m³”) pm = split[0] date_time = split[1]. replace(“at “, “”)(” “) date_pm = date_time[1] time_pm = date_time[2] data_dict[“pm25”] = pm data_dict[“url”] = url data_dict[“date”] = date_pm data_dict[“time”] = (data_dict) count += 1except SuchElementException: # Logging the info of locations that do not have PM2. 5 data for manual checking (f”{location} in {city}, {country} does not have PM2. 5″)# Terminating and re-instantiating webdriver every 200 URL to reduce the load on RAM if (i! = 0) and (i% 200 == 0): () driver = (desired_capabilities=desired_capabilities) (“Chromedriver restarted”)# Write the extracted data into a JSON file with open(“”, “w”) as f: (list_data_dict, f)(f”Scraped {count} PM2. 5 readings. “) ()The outcome of the program looks as shown below. The program has extracted PM2. 5 values from 4114 individual locations. Imagine opening these individual webpages and manually extracting the data. It is times like this makes us appreciate the use of web scraping programs or bots, in [ { “country”: ” Afghanistan”, “city”: “Kabul”, “location”: “US Diplomatic Post: Kabul”, “pm25”: “33”, “url”: “, “date”: “2020/07/31”, “time”: “11:00”}, { “country”: ” Algeria”, “city”: “Algiers”, “location”: “US Diplomatic Post: Algiers”, “pm25”: “31”, “url”: “, “date”: “2020/07/31”, “time”: “08:30”}, { “country”: ” Australia”, “city”: “Adelaide”, “location”: “CBD”, “pm25”: “9”, “url”: “, “date”: “2020/07/31”, “time”: “11:00”},… ]I hope this tutorial has given you the confidence to start web scraping with Selenium. The complete code of the example is available in my GitHub repository. In the next tutorial, I shall show you how to integrate Selenium with then, Good Luck. Stay safe and happy learning.!
Web Scraping 101 (Using Selenium for Java) – Gal Abramovitz
Photo by rawpixel on UnsplashWeb Scraping is one of the most useful skills in today’s digital sically it takes web-browsing to the next level, by automatizing everyday actions, such as opening URLs, reading text and data, clicking links, though web scraping is relatively easy to learn and execute, it’s a powerful tool that you can use to collect existing data from websites, then easily manipulate, analyze and store it to your liking; or to automate workflows in web-based this tutorial I’ll show you step-by-step:How to set up Selenium in IntelliJ to build a basic web scraper, that can read data from a webpage. * The stages are followed by matching screenshotsI found it easiest to use Selenium Standalone Server. Download the JAR file in the given IntelliJ IDEA and create a new one of your project’s directories and click Open Module the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…Choose the JAR file you’ve downloaded in stage 1, then click External Libraries should now contain the JAR wnload Geckodriver (choose the file according to your operating system) and put the executable file in the project’s ’s it – you’re all set! You can take a look at the self-explanatory screenshots and move on to the second part of the 3: click Open Module SettingsStage 4: in the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…Stage 6: The External Libraries should now contain the JAR fileStage 7: your project’s directory should look like thisAfter we’ve set up our working environment, let’s use Selenium to build a basic web ’ll use the HTML structure of a webpage and read specific data (we’ll actually take advantage of the CSS code, but the principles are the same). Preparing for web scrapingBefore we actually start to scrape, we need to understand what we’re looking for. This part will be much easier for you if you can read HTML this tutorial we’ll scrape my website and extract the list of languages I use. It might seem a tough task at first, but a glance into the paragraph’s HTML code will reveal a surprising structure:
Java
,
JavaScript
,
jQuery
,
HTML
and
CSS
. My university projects also use
C
,
C++
,
Assembly
,
SQL
and
Python
.
As I’m highly aware to the elegance of my products, I keep learning and pushing myself towards cleaner code and more beautiful UI.
You can tell immediately that each language name is wrapped in a separate
tag. We’ll use this fact later in our scraper’s code. I obviously already know my own website, but unravelling each page’s code is the first challenge that you’ll need to solve. In most cases there is an obvious logic to the HTML structure, that we can take advantage of. Understand the HTML logic – and your life would get much next step would be to plan the scraping workflow. I suggest you go through your workflow manually a couple of times before writing your web this basic example, this is the workflow we’re going to implement:Open a new browser vigate to the menu option “ABOUT” the languages names, according to the logic we’ve lenium Building BlocksI like to think about a (basic) web scraper as a synchronous program that mainly uses two objects instances:A WebDriver instance, which is our code’s connection to the browser’s window: it allows to read its code and simulate user’s WebDriver instance corresponds to an individual browser window. (in this tutorial I’ll use a FirefoxDriver, just because I usually use Chrome and I like to have a separate browser for my web scraping applications)A WebDriverWait instance, which allows us to synchronize our actions (e. g. click a button only after it loads and is clickable) might seem a bit abstract, but don’t worry – in a moment we’ll use both of them in nally – let’s write some code! (The full code is provided in this repository)We’ll start by initializing a FirefoxDriver and a WebDriverWait:FirefoxDriver driver = new FirefoxDriver();WebDriverWait wait = new WebDriverWait(driver, 30);Note that WebDriverWait remains the same, regardless of the browser you choose, and that it’s connected to the FirefoxDriver instance. The second argument in its constructor represents the maximum amount of time that our program should wait for an action (e. a page load) initialization, we have one new Firefox window that’s controlled by our program. Now let’s navigate to a specific vigate()(“);Now we want to click an element. We’ll need to specify a one-to-one address that’ll point this exact element. There are a few addresses we can use, but I find XPath to be the most convenient order to find the button’s XPath I’m going to use Chrome’s Developer Tools: right-click the element, then click page’s HTML code will be opened and the inspected element will be highlighted. Right-click it, click Copy and then Copy you have the element’s XPath that can be used as its address! Let’s save it as a String:String aboutButtonXpath = “//*[@id=\”about\””]div/a””;The code is executed regardless of the browser’s state
so we need to make sure the button element is loaded before we try to click it. As mentioned
we can use our wait instance just for that (one of the reasons I love Selenium is because the syntax is self-explanatory)(ExpectedConditions. elementToBeClickable((aboutButtonXpath)));Our WebDriverWait instance blocks our thread until the button element is clickable. Hence
Frequently Asked Questions about web scraping with java selenium
we make sure the relevant paragraph is loaded before trying to copy its data:String languagesParagraphXpath = “//*[@id=\””page1\””]/div[2]/div[5]”;(sibilityOfElementLocated((languagesParagraphXpath)));Now we can use the logic that we’ve defined in the workflow. Using ndElements() method
we’ll retrieve a list of WebElements
each representing a
that contains a language name:List which suggests that the expected return value is a ’re finished scraping I chose to just print the language names. Howeverso we can close the browser ();Lastly