Scrape Website Without Getting Blocked
5 Tips For Web Scraping Without Getting Blocked or Blacklisted
Web scraping can be difficult, particularly when most popular sites actively try to prevent developers from scraping their websites using a variety of techniques such as IP address detection, HTTP request header checking, CAPTCHAs, javascript checks, and more. On the other hand, there are many analogous strategies that developers can use to avoid these blocks as well, allowing them to build web scrapers that are nearly impossible to detect. Here are a few quick tips on how to crawl a website without getting blocked:
1. IP Rotation
The number one way sites detect web scrapers is by examining their IP address, thus most of web scraping without getting blocked is using a number of different IP addresses to avoid any one IP address from getting banned. To avoid sending all of your requests through the same IP address, you can use an IP rotation service like Scraper API or other proxy services in order to route your requests through a series of different IP addresses. This will allow you to scrape the majority of websites without issue.
For sites using more advanced proxy blacklists, you may need to try using residential or mobile proxies, if you are not familiar with what this means you can check out our article on different types of proxies here. Ultimately, the number of IP addresses in the world is fixed, and the vast majority of people surfing the internet only get 1 (the IP address given to them by their internet service provider for their home internet), therefore having say 1 million IP addresses will allow you to surf as much as 1 million ordinary internet users without arousing suspicion. This is by far the most common way that sites block web crawlers, so if you are getting blocked getting more IP addresses is the first thing you should try.
2. Set a Real User Agent
User Agents are a special type of HTTP header that will tell the website you are visiting exactly what browser you are using. Some websites will examine User Agents and block requests from User Agents that don’t belong to a major browser. Most web scrapers don’t bother setting the User Agent, and are therefore easily detected by checking for missing User Agents. Don’t be one of these developers! Remember to set a popular User Agent for your web crawler (you can find a list of popular User Agents here). For advanced users, you can also set your User Agent to the Googlebot User Agent since most websites want to be listed on Google and therefore let Googlebot through. It’s important to remember to keep the User Agents you use relatively up to date, every new update to Google Chrome, Safari, Firefox, etc. has a completely different user agent, so if you go years without changing the user agent on your crawlers, they will become more and more suspicious. It may also be smart to rotate between a number of different user agents so that there isn’t a sudden spike in requests from one exact user agent to a site (this would also be fairly easy to detect).
3. Set Other Request Headers
Real web browsers will have a whole host of headers set, any of which can be checked by careful websites to block your web scraper. In order to make your scraper appear to be a real browser, you can navigate to, and simply copy the headers that you see there (they are the headers that your current web browser is using). Things like “Accept”, “Accept-Encoding”, “Accept-Language”, and “Upgrade-Insecure-Requests” being set will make your requests look like they are coming from a real browser so you won’t get your web scraping blocked. For example, the headers from the latest Google Chrome is:
“Accept”: “text/html, application/xhtml+xml, application/xml;q=0. 9, image/webp,
image/apng, */*;q=0. 8, application/signed-exchange;v=b3″,
“Accept-Encoding”: “gzip”,
“Accept-Language”: “en-US, en;q=0. 9, es;q=0. 8”,
“Upgrade-Insecure-Requests”: “1”,
“User-Agent”: “Mozilla/5. 0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/76. 0. 3809. 132 Safari/537. 36”
By rotating through a series of IP addresses and setting proper HTTP request headers (especially User Agents), you should be able to avoid being detected by 99% of websites.
4. Set Random Intervals In Between Your Requests
It is easy to detect a web scraper that sends exactly one request each second 24 hours a day! No real person would ever use a website like that, and an obvious pattern like this is easily detectable. Use randomized delays (anywhere between 2-10 seconds for example) in order to build a web scraper that can avoid being blocked. Also, remember to be polite, if you send requests too fast you can crash the website for everyone, if you detect that your requests are getting slower and slower, you may want to send requests more slowly so you don’t overload the web server (you’ll definitely want to do this to help frameworks like Scrapy avoid being banned).
For especially polite crawlers, you can check a site’s (this will be located at or), often they will have a line that says crawl-delay that will tell you how many seconds you should wait in between requests you send to the site so that you don’t cause any issues with heavy server traffic.
5. Set a Referrer
The Referer header is an request header that lets the site know what site you are arriving from. Generally it’s a good idea to set this so that it looks like you’re arriving from Google, you can do this with the header:
“Referer”: “
You can also change this up for websites in different countries, for example if you are trying to scrape a site in the UK, you might want to use “ instead of “. You can also look up the most common referers to any site using a tool like, often this will be a social media site like Youtube or some social media sites. By setting this header, it makes your request look even more authentic, as it appears to be traffic from a site that the webmaster would be expecting a lot of traffic to come from during normal usage.
For more advanced users scraping particularly difficult to scrape sites, we’ve added these 5 advanced web scraping tips.
6. Use a Headless Browser
The trickiest websites to scrape may detect subtle tells like web fonts, extensions, browser cookies, and javascript execution in order to determine whether or not the request is coming from a real user. In order to scrape these websites you may need to deploy your own headless browser (or have Scraper API do it for you! ).
Tools like Selenium and Puppeteer will allow you to write a program to control a real web browser that is identical to what a real user would use in order to completely avoid detection. While this is quite a bit of work to make Selenium undetectable or Puppeteer undetectable, this is the most effective way to scrape websites that would otherwise give you quite some difficulty. Note that you should only use these tools for web scraping if absolutely necessary, these programmatically controllable browsers are extremely CPU and memory intensive and can sometimes crash. There is no need to use these tools for the vast majority of sites (where a simple GET request will do), so only reach for these tools if you are getting blocked for not using a real browser!
7. Avoid Honeypot Traps
A lot of sites will try to detect web crawlers by putting in invisible links that only a robot would follow. You need to detect whether a link has the “display: none” or “visibility: hidden” CSS properties set, and if they do avoid following that link, otherwise a site will be able to correctly identify you as a programmatic scraper, fingerprint the properties of your requests, and block you quite easily. Honeypots are one of the easiest ways for smart webmasters to detect crawlers, so make sure that you are performing this check on each page that you scrape. Advanced webmasters may also just set the color to white (or to whatever color the background color of the page is), so you may want to check if the link has something like “color: #fff;” or “color: #ffffff” set, as this would may the link effectively invisible as well.
8. Detect Website Changes
Many websites change layouts for many reasons and this will often cause scrapers to break. In addition, some websites will have different layouts in unexpected places (page 1 of the search results may have a different layout than page 4). This is true even for surprisingly large companies that are less tech savvy, e. g. large retail stores that are just making the transition online. You need to properly detect these changes when building your scraper, and create ongoing monitoring so that you know your crawler is still working (usually just counting the number of successful requests per crawl should do the trick).
Another easy way to set up monitoring is to write a unit test for a specific URL on the site (or one URL of each type, for example on a reviews site you may want to write a unit test for the search results page, another unit test for the reviews page, another unit test for the main product page, etc. ). This way you can check for breaking site changes using only a few requests every 24 hours or so without having to go through a full crawl to detect errors.
9. Use a CAPTCHA Solving Service
One of the most common ways for sites to crack down on crawlers is to display a CAPTCHA. Luckily, there are services specifically designed to get past these restrictions in an economical way, whether they are fully integrated solutions like Scraper API or narrow CAPTCHA solving solutions that you can integrate just for the CAPTCHA solving functionality like 2Captcha or AntiCAPTCHA. For sites that resort to CAPTCHAs, it may be necessary to use one of these solutions. Note that some of these CAPTCHA solving services are fairly slow and expensive, so you may need to consider whether it is still economically viable to scrape sites that require continuous CAPTCHA solving over time.
10. Scrape Out of the Google Cache
As a true last resort, particularly for data that does not change too often, you may be able to scrape data out of Google’s cached copy of the website rather than the website itself. Simply prepend “ to the beginning of the URL (for example to scrape Scraper API’s documentation you could scrape “.
This is a good workaround for non-time sensitive information that is on extremely hard to scrape sites. While scraping out of Google’s cache can be a bit more reliable than scraping a site that is actively trying to block your scrapers, remember that this is not a fool proof solution, for example some sites like LinkedIn actively tell Google not to cache their data and the data for unpopular sites may be fairly out of date as Google determines how often they should crawl a site based on the site’s popularity as well as how many pages exist on that site.
Hopefully you’ve learned a few useful tips for scraping popular websites without being blacklisted or IP banned. While just setting up IP rotation and proper HTTP request headers should be more than enough in most cases, sometimes you will have to resort to more advanced techniques like using a headless browser or scraping out of the Google cache to get the data you need.
As always, it’s important to be respectful to webmasters and other users of the site when scraping, so if you detect that the site is slowing down you need to slow down your request rate. This is especially important when scraping smaller sites that may not have the resources that large enterprises may have for web hosting.
If you have a web scraping job you’d like to talk to us about helping your web scraper avoid detection please fill out this form and we’ll get back to you within 24 hours. Happy scraping!
How to Scrape Websites Without Being Blocked in 5 Mins?
Web scraping is a technique often employed for automating human browsing behavior for the purpose of retrieving large amounts of data from the web pages efficiently.
While various web scraping tools, like Octoparse, are getting popular around and benefit people substantially in all fields, they come with a price for web owners. A straightforward example is when web scraping overloads a web server and leads to a server breakdown. More and more web owners have equipped their sites with all kinds of anti-scraping techniques to block scrapers, which makes web scraping more difficult. Nevertheless, there are still ways to fight against blocking.
How to scrape without being blocked?
In this article, we will talk about 5 tips you can follow to scrape without being blacklisted or blocked.
1. Slow down the scraping
Most web scraping activities aim to fetch data as quickly as possible. However, when a human visits a site, the browsing is going to be much slower compared to what happens with web scraping. Therefore, it is really easy for a site to catch you as a scraper by tracking your access speed. Once it finds you are going through the pages too fast, it will suspect that you are not a human and block you naturally.
So please do not overload the site. You can put some random time delay between requests and reduce concurrent page access to 1-2 pages every time. Learn to treat the website nicely, then you are able to keep scraping it.
In Octoparse, users can set up a wait time for any steps in the workflow to control the scraping speed. There is even a “random” option to make the scraping more human-like.
2. Use proxy servers
When a site detects there are a number of requests from a single IP address, it will easily block the IP address. To avoid sending all of your requests through the same IP address, you can use proxy servers. A proxy server is a server (a computer system or an application) that acts as an intermediary for requests from clients seeking resources from other servers (from Wikipedia: Proxy server). It allows you to send requests to websites using the IP you set up, masking your real IP address.
Of course, if you use a single IP set up in the proxy server, it is still easy to get blocked. You need to create a pool of IP addresses and use them randomly to route your requests through a series of different IP addresses.
Many servers, such as VPNs, can help you to get rotated IP. Octoparse Cloud Service is supported by hundreds of cloud servers, each with a unique IP address. When an extraction task is set to execute in the Cloud, requests are performed on the target website through various IPs, minimizing the chances of being traced. Octoparse local extraction allows users to set up proxies to avoid being blocked.
3. Apply different scraping patterns
Humans browse a site with random clicks or view time; however, web scraping always follows the same crawling pattern as programmed bots follow a specific logic. So anti-scraping mechanisms can easily detect the crawler by identifying the repetitive scraping behaviors performed on a website.
You will need to change your scraping pattern from time to time and incorporate random clicks, mouse movements, or waiting time to make web scraping more human.
In Octoparse, you can easily set up a workflow in 3-5 minutes. You can add clicks and mouse movements easily with drags and points or even rebuild a workflow quickly, saving lots of coding time for programmers and help non-coders to make their own scrapers easily.
4. Switch user-agents
A user-agent(UA) is a string in the header of a request, identifying the browser and operating system to the webserver. Every request made by a web browser contains a user-agent. Using a user-agent for an abnormally large number of requests will lead you to the block.
To get past the block, you should switch user-agent frequency instead of sticking to one.
Many programmers add fake user-agent in the header or manually make a list of user-agents to avoid being blocked. With Octoparse, you can easily enable automatic UA rotation in your crawler to reduce the risk of being blocked.
5. Be careful of honeypot traps
Honeypots are links that are invisible to normal visitors but are there in the HTML code and can be found by web scrapers. They are just like traps to detect scrapers by directing them to blank pages. Once a particular visitor browses a honeypot page, the website can be relatively sure it is not a human visitor and starts throttling or blocking all requests from that client.
When building a scraper for a particular site, it is worth looking carefully to check whether there are any links hidden to users using a standard browser.
Octoparse uses XPath for precise capturing or clicking actions, avoiding clicking the faked links (see how to use XPath to locate elements here).
All the tips provided in this article can help you avoid getting blocked to some extent. While web scraping tech climbs afoot, the anti-scraping tech climbs ten. Share your ideas with us or if you feel anything can be added to the list.
Some e-commerce websites, such as Amazon, eBay, have severe blocking mechaniscm, which you may find difficult to scrape even after applying the rules above. Don’t worry, Octoparse data service can offer you the solution you want.
We work closely with you to understand your data requirement and make sure we deliver what you desire. Talk to Octoparse data expert now to discuss how web scraping services can help you maximize efforts.
Related articles you may be interested in:
9 Web Scraping Challenges You Should Know
Web Scraping Challenges and Workarounds
Web Scraping 10 Myths that Everyone Should Know
Artículo en español: ¿Cómo Scrape Websites sin ser bloqueado? También puede leer artículos de web scraping en el website oficial
Artikel auf Deutsch: Wie kann man Websites scrapen, ohne blockiert zu werden? Sie können unsere deutsche Website besuchen.
How to Scrape Websites at Large Scale
9 FREE Web Scrapers That You Cannot Miss in 2021
25 Ways to Grow Your Business with Web Scraping
Web Scraping 101: 10 Myths that Everyone Should Know
Top 20 Web Crawling Tools to Scrape Websites Quickly
Web Scraping without getting blocked – ScrapingBee
●
Updated:
01 February, 2021
14 min read
Pierre is a data engineer who worked in several high-growth startups before co-founding ScrapingBee. He is an expert in data processing and web scraping.
Introduction
Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.
“But you should use an API for this! ”
However, not every website offers an API, and APIs don’t always expose every piece of information you need. So, it’s often the only solution to extract website data.
There are many use cases for web scraping:
E-commerce price monitoring
News aggregation
Lead generation
SEO (search engine result page monitoring)
Bank account aggregation (Mint in the US, Bankin’ in Europe)
Individuals and researchers building datasets otherwise not available.
The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except Google – they all want to be scraped by Google).
So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.
This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.
Why Using Headless Browsing?
When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.
The thing is, if you just do: curl, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, This is the infamous “User-Agent” header. Just by looking at the “User-Agent” header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.
Headers are easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you’d need to set more than one header. But it is not difficult to artificially forge an HTTP request with cURL or any library to make the request look exactly like a request made with a browser. Everybody knows this. So, to determine if you are using a real browser, websites will check something that cURL and library can not do: executing Javascript code.
Do you speak Javascript?
The concept is simple, the website embeds a Javascript snippet in its webpage that, once executed, will “unlock” the webpage. If you’re using a real browser, you won’t notice the difference. If you’re not, you’ll receive an HTML page with some obscure Javascript code in it:
an actual example of such a snippet
Once again, this solution is not completely bulletproof, mainly because it is now very easy to execute Javascript outside of a browser with However, the web has evolved and there are other tricks to determine if you are using a real browser.
Headless Browsing
Trying to execute Javascript snippets on the side with is difficult and not robust. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with become useless. So the best way to look like a real browser is to actually use one.
Headless Browsers will behave like a real browser except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.
The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. Selenium Playwright and Puppeteer are the three most famous solutions.
However, it will not be enough as websites now have tools that detect headless browsers. This arms race has been going on for a long time.
While these solutions can be easy to do on your local computer, it can be trickier to make this work at scale.
Managing lots of Chrome headless instances is one of the many problems we solve at ScrapingBee
Tired of getting blocked while scraping the web?
Our API handles headless browsers and rotates proxies for you.
Browser Fingerprinting
Everyone, especially front-end devs, know that every browser behaves differently. Sometimes it’s about rendering CSS, sometimes Javascript, and sometimes just internal properties. Most of these differences are well-known and it is now possible to detect if a browser is actually who it pretends to be. This means the website asks “do all of the browser properties and behaviors match what I know about the User-Agent sent by this browser? “.
This is why there is an everlasting arms race between web scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.
However, in this arms race, web scrapers tend to have a big advantage here is why:
Screenshot of Chrome malware alert
Most of the time, when a Javascript code tries to detect whether it’s being run in headless mode, it is when a malware is trying to evade behavioral fingerprinting. This means that the Javascript will behave nicely inside a scanning environment and badly inside real browsers. And this is why the team behind the Chrome headless mode is trying to make it indistinguishable from a real user’s web browser in order to stop malware from doing that. Web scrapers can profit from this effort.
Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.
By the way, if you still want to use cURL to scrape the web, we just published a guide on how to use a proxy with cURL, check it out.
If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel’s blog, which is entirely dedicated to this subject.
That’s about all you need to know about how to pretend like you are using a real browser. Let’s now take a look at how to behave like a real human.
TLS Fingerprinting
What is it?
TLS stands for Transport Layer Security and is the successor of SSL which was basically what the “S” of HTTPS stood for.
This protocol ensures privacy and data integrity between two or more communicating computer applications (in our case, a web browser or a script and an HTTP server).
Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify users based on the way they use TLS.
How this protocol works can be split into two big parts.
First, when the client connects to the server, a TLS handshake happens. During this handshake, many requests are sent between the two to ensure that everyone is actually who they claim to be.
Then, if the handshake has been successful the protocol describes how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.
Most of the data point used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database.
On this website, you can see that the most used fingerprint last week was used 22. 19% of the time (at the time of writing this article).
A TLS fingerprint
This number is very big and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.
Those parameters are, amongst others:
TLS version
Handshake version
Cipher suites supported
Extensions
If you wish to know what your TLS fingerprint is, I suggest you visit this website.
How do I change it?
Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.
Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won’t work. Your fingerprint will be so rare that it will be instantly flagged as fake.
Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward.
For examples, the famous Python requests module doesn’t support changing the TLS fingerprint out of the box. Here are a few resources to change your TLS version and cypher suite in your favorite language:
Python with HTTPAdapter and requests
NodeJS with the TLS package
Ruby with OpenSSL
Keep in mind that most of these libraries rely on the SSL and TLS implementation of your system, OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.
Emulate Human Behaviour: Proxy, Captcha Solving and Request Patterns
Proxy Yourself
A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of page from the same website you have to trick the website into thinking that all those requests come from different places in the world i. e: different I. P addresses. In other words, you need to use proxies.
Proxies are not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxy IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.
There are several proxy solutions on the market, here are the most used rotating proxy providers: Luminati Network, Blazing SEO and SmartProxy.
There is also a lot of free proxy lists and I don’t recommend using these because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is important. Anti-crawling services are known to maintain an internal list of proxy IP so any traffic coming from those IPs will also be blocked. Be careful to choose a good reputation. This is why I recommend using a paid proxy network or build your own.
Another proxy type that you could look into is mobile, 3g and 4g proxies. This is helpful for scraping hard-to-scrape mobile first websites, like social media.
To build your own proxy you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then, you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist …) it can be a little tedious to put in place.
You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many cases, TOR won’t help you, compared to classic proxies. It’s worth noting that traffic through TOR is also inherently much slower because of the multiple routing.
Captchas
Sometimes proxies will not be enough. Some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you’ll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).
While some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.
Old captcha, breakable programatically
Google ReCaptcha V2
If you use the aforementioned services, on the other side of the API call you’ll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.
But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your data extraction process.
Request Pattern
Another advanced tool used by websites to detect scraping is pattern recognition. So if you plan to scrape every IDs from 1 to 10 000 for the URL, try to not do it sequentially or with a constant rate of request. You could, for example, maintain a set of integer going from 1 to 10 000 and randomly choose one integer inside this set and then scrape your product.
Some websites also do statistic on browser fingerprint per endpoint. This means that if you don’t change some parameters in your headless browser and target a single endpoint, they might block you.
Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try to not do it with proxies in Vietnam.
But from experience, I can tell you that rate is the most important factor in “Request Pattern Recognition”, so the slower you scrape, the less chance you have of being discovered.
Emulate Machine Behaviour: Reverse engineering of API
Sometimes, the server expect the client to be a machine. In these cases, hiding yourself is way easier.
Reverse engineering of API
Basically, this “trick” comes down to two things:
Analyzing a web page behaviour to find interesting API calls
Forging those API calls with your code
For example, let’s say that I want to get all the comments of a famous social network. I notice that when I click on the “load more comments” button, this happens in my inspector:
Request being made when clicking more comments
Notice that we filter out every requests except “XHR” ones to avoid noise.
When we try to see which request is being made and which response do we get… – bingo!
Request response
Now if we look at the “Headers” tab we should have everything we need to replay this request and understand the value of each parameters. This will allow us to make this request from a simple HTTP client.
HTTP Client response
The hardest part of this process is to understand the role of each parameter in the request. Know that you can left-click on any request in the Chrome dev tool inspector, export in HAR format and then import it in your favorite HTTP client, (I love Paw and PostMan).
This will allow you to have all the parameters of a working request laid out and will make your experimentation much faster and fun.
Previous request imported in Paw
Reverse-Engineering of Mobile Apps
The same principles apply when it comes to reverse engineering mobile app. You will want to intercept the request your mobile app make to the server and replay it with your code.
Doing this is hard for two reasons:
To intercept requests, you will need a Man In The Middle proxy. (Charles proxy for example)
Mobile Apps can fingerprint your request and obfuscate them more easily than a web app
For example, when Pokemon Go was released a few years ago, tons of people cheated the game after reverse-engineering the requests the mobile app made.
What they did not know was that the mobile app was sending a “secret” parameter that was not sent by the cheating script. It was easy for Niantic to then identify the cheaters. A few weeks after, a massive amount of players were banned for cheating.
Also, here is an interesting example about someone who reverse-engineered the Starbucks API.
Conclusion
Here is a recap of all the anti-bot techniques we saw in this article:
Anti-bot technique
Counter measure
Supported by ScrapingBee
Headless browsers
✅
IP-rate limiting
Rotating proxies
Banning Data center IPs
Residential IPs
Forge and rotate TLS fingerprints
Captchas on suspicious activity
All of the above
Systematic Captchas
Captchas-solving tools and services
❌
I hope that this overview will help you understand web-scraping and that you learned a lot reading this article.
We leverage everything I talked about in this post at ScrapingBee. Our web scraping API handles thousands of requests per second without ever being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us:).
We recently published a guide about the best web scraping tools on the market, don’t hesitate to take a look!