Headless Browser Web Scraping
Web Scraping with a Headless Browser: A Puppeteer Tutorial
In this article, we’ll see how easy it is to perform web scraping (web automation) with the somewhat non-traditional method of using a headless browser.
What Is a Headless Browser and Why Is It Needed?
The last few years have seen the web evolve from simplistic websites built with bare HTML and CSS. Now there are much more interactive web apps with beautiful UIs, which are often built with frameworks such as Angular or React. In other words, nowadays JavaScript rules the web, including almost everything you interact with on websites.
For our purposes, JavaScript is a client-side language. The server returns JavaScript files or scripts injected into an HTML response, and the browser processes it. Now, this is a problem if we are doing some kind of web scraping or web automation because more times than not, the content that we’d like to see or scrape is actually rendered by JavaScript code and is not accessible from the raw HTML response that the server delivers.
As we mentioned above, browsers do know how to process the JavaScript and render beautiful web pages. Now, what if we could leverage this functionality for our scraping needs and had a way to control browsers programmatically? That’s exactly where headless browser automation steps in!
Headless? Excuse me? Yes, this just means there’s no graphical user interface (GUI). Instead of interacting with visual elements the way you normally would—for example with a mouse or touch device—you automate use cases with a command-line interface (CLI).
Headless Chrome and Puppeteer
There are many web scraping tools that can be used for headless browsing, like or headless Firefox using Selenium. But today we’ll be exploring headless Chrome via Puppeteer, as it’s a relatively newer player, released at the start of 2018. Editor’s note: It’s worth mentioning Intoli’s Remote Browser, another new player, but that will have to be a subject for another article.
What exactly is Puppeteer? It’s a library which provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol. It’s maintained by the Chrome DevTools team and an awesome open-source community.
Enough talking—let’s jump into the code and explore the world of how to automate web scraping using Puppeteer’s headless browsing!
Preparing the Environment
First of all, you’ll need to have 8+ installed on your machine. You can install it here, or if you are CLI lover like me and like to work on Ubuntu, follow those commands:
curl -sL | sudo -E bash –
sudo apt-get install -y nodejs
You’ll also need some packages that may or may not be available on your system. Just to be safe, try to install those:
sudo apt-get install -yq –no-install-recommends libasound2 libatk1. 0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2. 0-0 libglib2. 0-0 libgtk-3-0 libnspr4 libpango-1. 0-0 libpangocairo-1. 0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3
Setup Headless Chrome and Puppeteer
I’d recommend installing Puppeteer with npm, as it’ll also include the stable up-to-date Chromium version that is guaranteed to work with the library.
Run this command in your project root directory:
npm i puppeteer –save
Note: This might take a while as Puppeteer will need to download and install Chromium in the background.
Okay, now that we are all set and configured, let the fun begin!
Using Puppeteer API for Automated Web Scraping
Let’s start our Puppeteer tutorial with a basic example. We’ll write a script that will cause our headless browser to take a screenshot of a website of our choice.
Create a new file in your project directory named and open it in your favorite code editor.
First, let’s import the Puppeteer library in your script:
const puppeteer = require(‘puppeteer’);
Next up, let’s take the URL from command-line arguments:
const url = [2];
if (! url) {
throw “Please provide a URL as the first argument”;}
Now, we need to keep in mind that Puppeteer is a promise-based library: It performs asynchronous calls to the headless Chrome instance under the hood. Let’s keep the code clean by using async/await. For that, we need to define an async function first and put all the Puppeteer code in there:
async function run () {
const browser = await ();
const page = await wPage();
await (url);
await reenshot({path: ”});
();}
run();
Altogether, the final code looks like this:
throw “Please provide URL as a first argument”;}
You can run it by executing the following command in the root directory of your project:
node
Wait a second, and boom! Our headless browser just created a file named and you can see the GitHub homepage rendered in it. Great, we have a working Chrome web scraper!
Let’s stop for a minute and explore what happens in our run() function above.
First, we launch a new headless browser instance, then we open a new page (tab) and navigate to the URL provided in the command-line argument. Lastly, we use Puppeteer’s built-in method for taking a screenshot, and we only need to provide the path where it should be saved. We also need to make sure to close the headless browser after we are done with our automation.
Now that we’ve covered the basics, let’s move on to something a bit more complex.
A Second Puppeteer Scraping Example
For the next part of our Puppeteer tutorial, let’s say we want to scrape down the newest articles from Hacker News.
Create a new file named and paste in the following code snippet:
function run () {
return new Promise(async (resolve, reject) => {
try {
await (“);
let urls = await page. evaluate(() => {
let results = [];
let items = document. querySelectorAll(‘orylink’);
rEach((item) => {
({
url: tAttribute(‘href’),
text: nerText, });});
return results;})
();
return resolve(urls);} catch (e) {
return reject(e);}})}
run()()();
Okay, there’s a bit more going on here compared with the previous example.
The first thing you might notice is that the run() function now returns a promise so the async prefix has moved to the promise function’s definition.
We’ve also wrapped all of our code in a try-catch block so that we can handle any errors that cause our promise to be rejected.
And finally, we’re using Puppeteer’s built-in method called evaluate(). This method lets us run custom JavaScript code as if we were executing it in the DevTools console. Anything returned from that function gets resolved by the promise. This method is very handy when it comes to scraping information or performing custom actions.
The code passed to the evaluate() method is pretty basic JavaScript that builds an array of objects, each having url and text fields that represent the story URLs we see on
The output of the script looks something like this (but with 30 entries, originally):
[ { url: ”,
text: ‘Bias detectives: the researchers striving to make algorithms fair’},
{ url: ”,
text: ‘Mino Games Is Hiring Programmers in Montreal’},
text: ‘A Beginner’s Guide to Firewalling with pf’},
//…
text: ‘ChaCha20 and Poly1305 for IETF Protocols’}]
Pretty neat, I’d say!
Okay, let’s move forward. We only had 30 items returned, while there are many more available—they are just on other pages. We need to click on the “More” button to load the next page of results.
Let’s modify our script a bit to add a support for pagination:
function run (pagesToScrape) {
if (! pagesToScrape) {
pagesToScrape = 1;}
let currentPage = 1;
let urls = [];
while (currentPage <= pagesToScrape) {
let newUrls = await page. evaluate(() => {
return results;});
urls = (newUrls);
if (currentPage < pagesToScrape) {
await ([
await ('relink'),
await page. waitForSelector('orylink')])}
currentPage++;}
run(5)()();
Let’s review what we did here:
We added a single argument called pagesToScrape to our main run() function. We’ll use this to limit how many pages our script will scrape.
There is one more new variable named currentPage which represents the number of the page of results are we looking at currently. It’s set to 1 initially. We also wrapped our evaluate() function in a while loop, so that it keeps running as long as currentPage is less than or equal to pagesToScrape.
We added the block for moving to a new page and waiting for the page to load before restarting the while loop.
You’ll notice that we used the () method to have the headless browser click on the “More” button. We also used the waitForSelector() method to make sure our logic is paused until the page contents are loaded.
Both of those are high-level Puppeteer API methods ready to use out-of-the-box.
One of the problems you’ll probably encounter during scraping with Puppeteer is waiting for a page to load. Hacker News has a relatively simple structure and it was fairly easy to wait for its page load completion. For more complex use cases, Puppeteer offers a wide range of built-in functionality, which you can explore in the API documentation on GitHub.
This is all pretty cool, but our Puppeteer tutorial hasn’t covered optimization yet. Let’s see how can we make Puppeteer run faster.
Optimizing Our Puppeteer Script
The general idea is to not let the headless browser do any extra work. This might include loading images, applying CSS rules, firing XHR requests, etc.
As with other tools, optimization of Puppeteer depends on the exact use case, so keep in mind that some of these ideas might not be suitable for your project. For instance, if we had avoided loading images in our first example, our screenshot might not have looked how we wanted.
Anyway, these optimizations can be accomplished either by caching the assets on the first request, or canceling the HTTP requests outright as they are initiated by the website.
Let’s see how caching works first.
You should be aware that when you launch a new headless browser instance, Puppeteer creates a temporary directory for its profile. It is removed when the browser is closed and is not available for use when you fire up a new instance—thus all the images, CSS, cookies, and other objects stored will not be accessible anymore.
We can force Puppeteer to use a custom path for storing data like cookies and cache, which will be reused every time we run it again—until they expire or are manually deleted.
const browser = await ({
userDataDir: '. /data', });
This should give us a nice bump in performance, as lots of CSS and images will be cached in the data directory upon the first request, and Chrome won’t need to download them again and again.
However, those assets will still be used when rendering the page. In our scraping needs of Y Combinator news articles, we don’t really need to worry about any visuals, including the images. We only care about bare HTML output, so let’s try to block every request.
Luckily, Puppeteer is pretty cool to work with, in this case, because it comes with support for custom hooks. We can provide an interceptor on every request and cancel the ones we don’t really need.
The interceptor can be defined in the following way:
await tRequestInterception(true);
('request', (request) => {
if (sourceType() === ‘document’) {
ntinue();} else {
();}});
As you can see, we have full control over the requests that get initiated. We can write custom logic to allow or abort specific requests based on their resourceType. We also have access to lots of other data like so we can block only specific URLs if we want.
In the above example, we only allow requests with the resource type of “document” to get through our filter, meaning that we will block all images, CSS, and everything else besides the original HTML response.
Here’s our final code:
await page. waitForSelector(‘orylink’);
await page. waitForSelector(‘relink’),
Stay Safe with Rate Limits
Headless browsers are very powerful tools. They’re able to perform almost any kind of web automation task, and Puppeteer makes this even easier. Despite all the possibilities, we must comply with a website’s terms of service to make sure we don’t abuse the system.
Since this aspect is more architecture-related, I won’t cover this in depth in this Puppeteer tutorial. That said, the most basic way to slow down a Puppeteer script is to add a sleep command to it:
js
await page. waitFor(5000);
This statement will force your script to sleep for five seconds (5000 ms). You can put this anywhere before ().
Just like limiting your use of third-party services, there are lots of other more robust ways to control your usage of Puppeteer. One example would be building a queue system with a limited number of workers. Every time you want to use Puppeteer, you’d push a new task into the queue, but there would only be a limited number of workers able to work on the tasks in it. This is a fairly common practice when dealing with third-party API rate limits and can be applied to Puppeteer web data scraping as well.
Puppeteer’s Place in the Fast-moving Web
In this Puppeteer tutorial, I’ve demonstrated its basic functionality as a web-scraping tool. However, it has much wider use cases, including headless browser testing, PDF generation, and performance monitoring, among many others.
Web technologies are moving forward fast. Some websites are so dependent on JavaScript rendering that it’s become nearly impossible to execute simple HTTP requests to scrape them or perform some sort of automation. Luckily, headless browsers are becoming more and more accessible to handle all of our automation needs, thanks to projects like Puppeteer and the awesome teams behind them!
Headless Browser and scraping – solutions [closed] – Stack …
I’m trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.
BROWSER TESTING / SCRAPING:
Selenium – polyglot flagship in browser automation, bindings for Python, Ruby, JavaScript, C#, Haskell and more, IDE for Firefox (as an extension) for faster test deployment. Can act as a Server and has tons of features.
JAVASCRIPT
PhantomJS – JavaScript, headless testing with screen capture and automation, uses Webkit. As of version 1. 8 Selenium’s WebDriver API is implemented, so you can use any WebDriver binding and tests will be compatible with Selenium
SlimerJS – similar to PhantomJS, uses Gecko (Firefox) instead of WebKit
CasperJS – JavaScript, build on both PhantomJS and SlimerJS, has extra features
Ghost Driver – JavaScript implementation of the WebDriver Wire Protocol for PhantomJS.
new PhantomCSS – CSS regression testing. A CasperJS module for automating visual regression testing with PhantomJS and
new WebdriverCSS – plugin for for automating visual regression testing
new PhantomFlow – Describe and visualize user flows through tests. An experimental approach to Web user interface testing.
new trifleJS – ports the PhantomJS API to use the Internet Explorer engine.
new CasperJS IDE (commercial)
Node-phantom – bridges the gap between PhantomJS and
WebDriverJs – Selenium WebDriver bindings for by Selenium Team
– node module for WebDriver/Selenium 2
yiewd – wrapper using latest Harmony generators! Get rid of the callback pyramid with yield
ZombieJs – Insanely fast, headless full-stack testing using
NightwatchJs – Node JS based testing solution using Selenium Webdriver
Chimera – Chimera: can do everything what phantomJS does, but in a full JS environment
– Automated cross browser testing with JavaScript through Selenium Webdriver
– better implementation of WebDriver bindings with predefined 50+ actions
Nightmare – Electron bridge with a high-level API.
jsdom – Tailored towards web scraping. A very lightweight DOM implemented in, it supports pages with javascript.
new Puppeteer – Node library which provides a high-level API to control Chrome or Chromium. Puppeteer runs headless by default.
WEB SCRAPING / MINING
Scrapy – Python, mainly a scraper/miner – fast, well documented and, can be linked with Django Dynamic Scraper for nice mining deployments, or Scrapy Cloud for PaaS (server-less) deployment, works in terminal or an server stand-alone proces, can be used with Celery, built on top of Twisted
Snailer – module, untested yet.
Node-Crawler – module, untested yet.
ONLINE TOOLS
new Web Scraping Language – Simple syntax to crawl the web
new Online HTTP client – Dedicated SO answer
dead CasperBox – Run CasperJS scripts online
Android TOOLS for Automation
new Mechanica Browser App
RELATED LINKS & RESOURCES
Comparsion of Webscraping software
new: Image analysis and comparison
Questions:
Any pure solution or Nodejs to PhanthomJS/CasperJS module that actually works and is documented?
Answer: Chimera seems to go in that direction, checkout Chimera
Other solutions capable of easier JavaScript injection than Selenium?
Do you know any pure ruby solutions?
Answer: Checkout the list created by rjk with ruby based solutions
Do you know any related tech or solution?
Feel free to edit this question and add content as you wish! Thank you for your contributions!
12
A kind of JS-based Selenium is It not only aims for automated frontend-tests, you can also do screenshots with it. It has webdrivers for all important browsers. Unfortunately those webdrivers seem to be worth improving (just not to say “buggy” to Firefox).
Not the answer you’re looking for? Browse other questions tagged selenium web-scraping scrapy phantomjs casperjs or ask your own question.
Web Scraping without getting blocked – ScrapingBee
●
Updated:
01 February, 2021
14 min read
Pierre is a data engineer who worked in several high-growth startups before co-founding ScrapingBee. He is an expert in data processing and web scraping.
Introduction
Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.
“But you should use an API for this! ”
However, not every website offers an API, and APIs don’t always expose every piece of information you need. So, it’s often the only solution to extract website data.
There are many use cases for web scraping:
E-commerce price monitoring
News aggregation
Lead generation
SEO (search engine result page monitoring)
Bank account aggregation (Mint in the US, Bankin’ in Europe)
Individuals and researchers building datasets otherwise not available.
The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except Google – they all want to be scraped by Google).
So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.
This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.
Why Using Headless Browsing?
When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.
The thing is, if you just do: curl, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, This is the infamous “User-Agent” header. Just by looking at the “User-Agent” header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.
Headers are easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you’d need to set more than one header. But it is not difficult to artificially forge an HTTP request with cURL or any library to make the request look exactly like a request made with a browser. Everybody knows this. So, to determine if you are using a real browser, websites will check something that cURL and library can not do: executing Javascript code.
Do you speak Javascript?
The concept is simple, the website embeds a Javascript snippet in its webpage that, once executed, will “unlock” the webpage. If you’re using a real browser, you won’t notice the difference. If you’re not, you’ll receive an HTML page with some obscure Javascript code in it:
an actual example of such a snippet
Once again, this solution is not completely bulletproof, mainly because it is now very easy to execute Javascript outside of a browser with However, the web has evolved and there are other tricks to determine if you are using a real browser.
Headless Browsing
Trying to execute Javascript snippets on the side with is difficult and not robust. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with become useless. So the best way to look like a real browser is to actually use one.
Headless Browsers will behave like a real browser except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.
The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. Selenium Playwright and Puppeteer are the three most famous solutions.
However, it will not be enough as websites now have tools that detect headless browsers. This arms race has been going on for a long time.
While these solutions can be easy to do on your local computer, it can be trickier to make this work at scale.
Managing lots of Chrome headless instances is one of the many problems we solve at ScrapingBee
Tired of getting blocked while scraping the web?
Our API handles headless browsers and rotates proxies for you.
Browser Fingerprinting
Everyone, especially front-end devs, know that every browser behaves differently. Sometimes it’s about rendering CSS, sometimes Javascript, and sometimes just internal properties. Most of these differences are well-known and it is now possible to detect if a browser is actually who it pretends to be. This means the website asks “do all of the browser properties and behaviors match what I know about the User-Agent sent by this browser? “.
This is why there is an everlasting arms race between web scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.
However, in this arms race, web scrapers tend to have a big advantage here is why:
Screenshot of Chrome malware alert
Most of the time, when a Javascript code tries to detect whether it’s being run in headless mode, it is when a malware is trying to evade behavioral fingerprinting. This means that the Javascript will behave nicely inside a scanning environment and badly inside real browsers. And this is why the team behind the Chrome headless mode is trying to make it indistinguishable from a real user’s web browser in order to stop malware from doing that. Web scrapers can profit from this effort.
Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.
By the way, if you still want to use cURL to scrape the web, we just published a guide on how to use a proxy with cURL, check it out.
If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel’s blog, which is entirely dedicated to this subject.
That’s about all you need to know about how to pretend like you are using a real browser. Let’s now take a look at how to behave like a real human.
TLS Fingerprinting
What is it?
TLS stands for Transport Layer Security and is the successor of SSL which was basically what the “S” of HTTPS stood for.
This protocol ensures privacy and data integrity between two or more communicating computer applications (in our case, a web browser or a script and an HTTP server).
Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify users based on the way they use TLS.
How this protocol works can be split into two big parts.
First, when the client connects to the server, a TLS handshake happens. During this handshake, many requests are sent between the two to ensure that everyone is actually who they claim to be.
Then, if the handshake has been successful the protocol describes how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.
Most of the data point used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database.
On this website, you can see that the most used fingerprint last week was used 22. 19% of the time (at the time of writing this article).
A TLS fingerprint
This number is very big and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.
Those parameters are, amongst others:
TLS version
Handshake version
Cipher suites supported
Extensions
If you wish to know what your TLS fingerprint is, I suggest you visit this website.
How do I change it?
Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.
Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won’t work. Your fingerprint will be so rare that it will be instantly flagged as fake.
Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward.
For examples, the famous Python requests module doesn’t support changing the TLS fingerprint out of the box. Here are a few resources to change your TLS version and cypher suite in your favorite language:
Python with HTTPAdapter and requests
NodeJS with the TLS package
Ruby with OpenSSL
Keep in mind that most of these libraries rely on the SSL and TLS implementation of your system, OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.
Emulate Human Behaviour: Proxy, Captcha Solving and Request Patterns
Proxy Yourself
A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of page from the same website you have to trick the website into thinking that all those requests come from different places in the world i. e: different I. P addresses. In other words, you need to use proxies.
Proxies are not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxy IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.
There are several proxy solutions on the market, here are the most used rotating proxy providers: Luminati Network, Blazing SEO and SmartProxy.
There is also a lot of free proxy lists and I don’t recommend using these because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is important. Anti-crawling services are known to maintain an internal list of proxy IP so any traffic coming from those IPs will also be blocked. Be careful to choose a good reputation. This is why I recommend using a paid proxy network or build your own.
Another proxy type that you could look into is mobile, 3g and 4g proxies. This is helpful for scraping hard-to-scrape mobile first websites, like social media.
To build your own proxy you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then, you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist …) it can be a little tedious to put in place.
You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many cases, TOR won’t help you, compared to classic proxies. It’s worth noting that traffic through TOR is also inherently much slower because of the multiple routing.
Captchas
Sometimes proxies will not be enough. Some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you’ll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).
While some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.
Old captcha, breakable programatically
Google ReCaptcha V2
If you use the aforementioned services, on the other side of the API call you’ll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.
But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your data extraction process.
Request Pattern
Another advanced tool used by websites to detect scraping is pattern recognition. So if you plan to scrape every IDs from 1 to 10 000 for the URL, try to not do it sequentially or with a constant rate of request. You could, for example, maintain a set of integer going from 1 to 10 000 and randomly choose one integer inside this set and then scrape your product.
Some websites also do statistic on browser fingerprint per endpoint. This means that if you don’t change some parameters in your headless browser and target a single endpoint, they might block you.
Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try to not do it with proxies in Vietnam.
But from experience, I can tell you that rate is the most important factor in “Request Pattern Recognition”, so the slower you scrape, the less chance you have of being discovered.
Emulate Machine Behaviour: Reverse engineering of API
Sometimes, the server expect the client to be a machine. In these cases, hiding yourself is way easier.
Reverse engineering of API
Basically, this “trick” comes down to two things:
Analyzing a web page behaviour to find interesting API calls
Forging those API calls with your code
For example, let’s say that I want to get all the comments of a famous social network. I notice that when I click on the “load more comments” button, this happens in my inspector:
Request being made when clicking more comments
Notice that we filter out every requests except “XHR” ones to avoid noise.
When we try to see which request is being made and which response do we get… – bingo!
Request response
Now if we look at the “Headers” tab we should have everything we need to replay this request and understand the value of each parameters. This will allow us to make this request from a simple HTTP client.
HTTP Client response
The hardest part of this process is to understand the role of each parameter in the request. Know that you can left-click on any request in the Chrome dev tool inspector, export in HAR format and then import it in your favorite HTTP client, (I love Paw and PostMan).
This will allow you to have all the parameters of a working request laid out and will make your experimentation much faster and fun.
Previous request imported in Paw
Reverse-Engineering of Mobile Apps
The same principles apply when it comes to reverse engineering mobile app. You will want to intercept the request your mobile app make to the server and replay it with your code.
Doing this is hard for two reasons:
To intercept requests, you will need a Man In The Middle proxy. (Charles proxy for example)
Mobile Apps can fingerprint your request and obfuscate them more easily than a web app
For example, when Pokemon Go was released a few years ago, tons of people cheated the game after reverse-engineering the requests the mobile app made.
What they did not know was that the mobile app was sending a “secret” parameter that was not sent by the cheating script. It was easy for Niantic to then identify the cheaters. A few weeks after, a massive amount of players were banned for cheating.
Also, here is an interesting example about someone who reverse-engineered the Starbucks API.
Conclusion
Here is a recap of all the anti-bot techniques we saw in this article:
Anti-bot technique
Counter measure
Supported by ScrapingBee
Headless browsers
✅
IP-rate limiting
Rotating proxies
Banning Data center IPs
Residential IPs
Forge and rotate TLS fingerprints
Captchas on suspicious activity
All of the above
Systematic Captchas
Captchas-solving tools and services
❌
I hope that this overview will help you understand web-scraping and that you learned a lot reading this article.
We leverage everything I talked about in this post at ScrapingBee. Our web scraping API handles thousands of requests per second without ever being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us:).
We recently published a guide about the best web scraping tools on the market, don’t hesitate to take a look!
Frequently Asked Questions about headless browser web scraping
What is headless web scraping?
A headless browser is a web browser with no user interface (UI) whatsoever. Instead, it follows instructions defined by software developers in different programming languages. Headless browsers are mostly used for running automated quality assurance tests, or to scrape websites.
Is headless scraping faster?
PRO: Headless Browsers are Faster than Real Browsers But you will typically see a 2x to 15x faster performance when using a headless browser.Sep 21, 2020
Can Web scraping be detected?
Websites can easily detect scrapers when they encounter repetitive and similar browsing behavior. Therefore, you need to apply different scraping patterns from time to time while extracting the data from the sites.Jun 3, 2019