• November 21, 2024

Web Scraping using Selenium and Python – ScrapingBee


… some stuff

Super title



h1 = nd_element_by_name(‘h1’)
h1 = nd_element_by_class_name(‘someclass’)
h1 = nd_element_by_xpath(‘//h1’)
h1 = nd_element_by_id(‘greatID’)
All these methods also have find_elements (note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
all_links = nd_elements_by_tag_name(‘a’)
Some elements aren’t easily accessible with an ID or a simple class, and that’s when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It’s a powerful way to extract any element on a page, based on it’s absolute position on the DOM, or relative to another element.
WebElement
A WebElement is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property
Clicking on the element with ()
Accessing an attribute with t_attribute(‘class’)
Sending text to an input with: nd_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That’s a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
Go to the login page using ()
Select the username input using nd_element_by_* and then nd_keys() to send text to the input
Follow the same process with the password input
Click on the login button using ()
Should be easy right? Let’s see the code:
login = nd_element_by_xpath(“//input”). send_keys(USERNAME)
password = nd_element_by_xpath(“//input[@type=’password’]”). send_keys(PASSWORD)
submit = nd_element_by_xpath(“//input[@value=’login’]”)()
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.
So, we’re going to check for the logout button. The logout button has the ID “logout” (easy)!
We can’t just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.
So we have to use a try/except block and catch the NoSuchElementException exception:
# dont forget from import NoSuchElementException
try:
logout_button = nd_element_by_id(“logout”)
print(‘Successfully logged in’)
except NoSuchElementException:
print(‘Incorrect login/password’)
We could easily take a screenshot using:
ve_screenshot(”)
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.
Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it’s simple and we don’t have to worry about these issues.
If you need to make screenshots at scale, feel free to try our new Screenshot API here.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
Use a (ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.
If you use a () you will probably use an arbitrary value. The problem is, you’re either waiting for too long or not enough.
Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.
element = WebDriverWait(driver, 5)(
esence_of_element_located((, “mySuperId”)))
finally:
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.
There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let’s say you want to take a screenshot of some information, but you first need to scroll a bit to see it.
You can easily do this with Selenium:
javaScript = “rollBy(0, 1000);”
driver. execute_script(javaScript)
Using a proxy with Selenium Wire
Unfortunately, Selenium proxy handling is quite basic. For example, it can’t handle proxy with authentication out of the box.
To solve this issue, you need to use Selenium Wire.
This package extends Selenium’s bindings and gives you access to all the underlying requests made by the browser.
If you need to use Selenium with a proxy with authentication this is the package you need.
pip install selenium-wire
This code snippet shows you how to quickly use your headless browser behind a proxy.
# Install the Python selenium-wire library:
# pip install selenium-wire
from seleniumwire import webdriver
proxy_username = “USER_NAME”
proxy_password = “PASSWORD”
proxy_url = ”
proxy_port = 8886
options = {
“proxy”: {
“”: f”{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}”,
“verify_ssl”: False, }, }
URL = ”
driver = (
executable_path=”YOUR-CHROME-EXECUTABLE-PATH”,
seleniumwire_options=options, )
(URL)
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
chrome_options = romeOptions()
### This blocks images and javascript requests
chrome_prefs = {
“fault_content_setting_values”: {
“images”: 2,
“javascript”: 2, }}
chrome_options. experimental_options[“prefs”] = chrome_prefs
###
chrome_options=chrome_options, )
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don’t hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn’t have an API, it’s maybe* a good idea to automate it with Selenium, just don’t forget this xkcd:” alt=”python headless browser scraping,scraping dynamic web pages python selenium,web scraping using selenium python,python selenium chrome headless not working,web scraping using selenium python github,python web scraping selenium login,python selenium headless,web scraping with python: beautifulsoup, requests & selenium” title=”python headless browser scraping,scraping dynamic web pages python selenium,web scraping using selenium python,python selenium chrome headless not working,web scraping using selenium python github,python web scraping selenium login,python selenium headless,web scraping with python: beautifulsoup, requests & selenium” />

python headless browser scraping,scraping dynamic web pages python selenium,web scraping using selenium python,python selenium chrome headless not working,web scraping using selenium python github,python web scraping selenium login,python selenium headless,web scraping with python: beautifulsoup, requests & selenium

= 0): # URL to link to transactions
if isFirst == 1: # already expanded +
isFirst = 0
else:
(url) # collapsed +, so expand
# Find closest element to URL element with correct class to get tran type nd_element_by_xpath(“//*[contains(@href, ‘/retail/transaction/results/’)]/following::td[@class=’txt_75b_lmnw_T1R10B1′]”)
# Get transaction status
status = nd_element_by_class_name(‘txt_70b_lmnw_t1r10b1’)
# Add to count if transaction found
if (tran_type in [‘Move In’, ‘Move Out’, ‘Switch’]) and
(status == “Complete”):
thisCount += 1
In the above code, the fields I retrieved were the transaction type and the status, then added to a count to determine how many transactions fit the rules that were specified. However, I could have retrieved other fields within the transaction detail, like date and time, subtype, etc.
For this project, the count was returned back to a calling application. However, it and other scraped data could have been stored in a flat file or a database as well.
Additional Possible Roadblocks and Solutions
Numerous other obstacles might be presented while scraping modern websites with your own browser instance, but most can be resolved. Here are a few:
Trying to find something before it appears
While browsing yourself, how often do you find that you are waiting for a page to come up, sometimes for many seconds? Well, the same can occur while navigating programmatically. You look for a class or other element – and it’s not there!
Luckily, Selenium has the ability to wait until it sees a certain element, and can timeout if the element doesn’t appear, like so:
element = WebDriverWait(driver, 10). until(esence_of_element_located((, “theFirstLabel”)))
Getting through a Captcha
Some sites employ Captcha or similar to prevent unwanted robots (which they might consider you). This can put a damper on web scraping and slow it way down.
For simple prompts (like “what’s 2 + 3? ”), these can generally be read and figured out easily. However, for more advanced barriers, there are libraries that can help try to crack it. Some examples are 2Captcha, Death by Captcha, and Bypass Captcha.
Website structural changes
Websites are meant to change – and they often do. That’s why when writing a scraping script, it’s best to keep this in mind. You’ll want to think about which methods you’ll use to find the data, and which not to use. Consider partial matching techniques, rather than trying to match a whole phrase. For example, a website might change a message from “No records found” to “No records located” – but if your match is on “No records, ” you should be okay. Also, consider whether to match on XPATH, ID, name, link text, tag or class name, or CSS selector – and which is least likely to change.
Summary: Python and Selenium
This was a brief demonstration to show that almost any website can be scraped, no matter what technologies are used and what complexities are involved. Basically, if you can browse the site yourself, it generally can be scraped.
Now, as a caveat, it does not mean that every website should be scraped. Some have legitimate restrictions in place, and there have been numerous court cases deciding the legality of scraping certain sites. On the other hand, some sites welcome and encourage data to be retrieved from their website and in some cases provide an API to make things easier.
Either way, it’s best to check with the terms and conditions before starting any project. But if you do go ahead, be assured that you can get the job done.
Recommended Resources for Complex Web Scraping:
Advanced Python Web Scraping: Best Practices & Workarounds
Scalable do-it-yourself scraping: How to build and run scrapers on a large scale” alt=”” title=”” />

Frequently Asked Questions about Not the answer you’re looking for? Browse other questions tagged python python-3. x selenium google-chrome selenium-webdriver or ask your own question.”

Leave a Reply

Your email address will not be published. Required fields are marked *