Prevent Web Scraping
Prevent Web Scraping – A Step by Step Guide – DataDome
Who uses web scraper bots, and why?
Your content is gold, and it’s the reason visitors come to your website. Threat actors also want your gold, and use scraper bot attacks to gather and exploit your web content—to republish content with no overhead, or to undercut your prices automatically, for example.
Online retailers often hire professional web scrapers or use web scraping tools to gather competitive intelligence to craft future retail pricing strategies and product catalogs.
Threat actors try their best to disguise their bad web scraping bots as good ones, such as the ubiquitous Googlebots. DataDome identifies over 1 million hits per day from fake Googlebots on all customer websites.
Read more: TheFork (TripAdvisor) blocks scraping on its applications
The anatomy of a scraping attack
Scraping attacks contain three main phases:
Target URL address and parameter values: Web scrapers identify their targets and make preparations to limit scraping attack detection by creating fake user accounts, masking their malicious scraper bots as good ones, obfuscating their source IP addresses, and more.
Run scraping tools & processes: The army of scraper bots run on the target website, mobile app or API. The often intense level of bot traffic will often overload servers and result in poor website performance or even downtime.
Extract content and data: Web scrapers extract proprietary content and database records from the target and store it in their database for later analysis and abuse.
Figure 1: OAT-011 indicative diagram. Source: OWASP.
Common protection strategies against web scraping
Common anti crawler protection strategies include:
Monitoring new or existing user accounts with high levels of activity and no purchases
Detecting abnormally high volumes of product views as a sign of non-human activity
Tracking the activity of competitors for signs of price and product catalog matching
Enforcing site terms and conditions that stop malicious web scraping
Employing bot protection capabilities with deep behavioral analysis to pinpoint bad bots and prevent web scraping
Site owners commonly use “” files to communicate their intentions when it comes to scraping. files permit scraping bots to traverse specific pages; however, malicious bots don’t care about files (which serve as a “no trespassing” sign).
A clear, binding terms of use agreement that dictates permitted and non-permitted activity can potentially help in litigation. Check out our terms and conditions template for precise, enforceable anti-scraping wording.
Scrapers will do everything in their power to disguise scraping bots as genuine users. The ability to scrape publicly available content, register fake user accounts for malicious bots, and pass valid HTTP requests from randomly generated device IDs and IP addresses, deems traditional rule-based security measures, such as WAFs, ineffective against sophisticated scraping attacks.
How DataDome protects against website and content scraping
A good bot detection solution will be able to identify visitor behavior that shows signs of web scraping in real time, and automatically block malicious bots before scraping attacks unravel while maintaining a smooth experience for real human users. To correctly identify fraudulent traffic and block web scraping tools, a bot protection solution must be able to analyze both technical and behavioral data.
“Bots were scraping our website in order to steal our content and then sell it to third parties. Since we’ve activated the [DataDome bot] protection, web scraper bots are blocked and cannot access the website. Our data are secured and no longer accessible to bots. We are also now able to monitor technical logs in order to detect abnormal behaviors such as aggressive IP addresses or unusual queries. ”
Head of Technical Dept., Enterprise (1001-5000 employees)
DataDome employs a two-layer bot detection engine to help CTOs and CISOs protect their websites, mobile apps, and APIs from malicious scraping bots & block web scraping tools. It compares every site hit with a massive in-memory pattern database, and uses a blend of AI and machine learning to decide in less than 2 milliseconds whether to grant access to your pages or not.
DataDome is the only bot protection solution delivered as-a-service. It deploys in minutes on any web architecture, is unmatched in brute force attack detection speed and accuracy, and runs on autopilot. You will receive real-time notifications whenever your site is under scraping attack, but no intervention is required. Once you have set up a whitelist of trusted partner bots, DataDome will take care of all unwanted traffic and stop malicious bots from crawling your site in order to prevent website scraping.
Want to see is scraper bots are on your site? You can test your site today. (It’s easy & free. )
How do I prevent site scraping? [closed] – Stack Overflow
Note: Since the complete version of this answer exceeds Stack Overflow’s length limit, you’ll need to head to GitHub to read the extended version, with more tips and details.
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and, by extension, what prevents them from working well.
There’s various types of scraper, and each works differently:
Spiders, such as Google’s bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHub or Kimono. In fact, there’s people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site’s pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy – paste: People will copy and paste your content in order to use it elsewhere.
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
These tips mostly my own ideas, various difficulties that I’ve encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
How to stop scraping
You can’t completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
Specifically, some ideas:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time – for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don’t just monitor & rate limit by IP address – use other indicators too:
If you do block or rate limit, don’t just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it’s probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won’t inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address? for more details, and
Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require registration & login
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Block access from cloud hosting and scraping service IP addresses
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
Make your error message nondescript if you do block
If you do block / limit access, you should ensure that you don’t tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present!
Instead, show a friendly error message that doesn’t tell the scraper what caused it. Something like this is much better:
Sorry, something went wrong. You can contact support via, should the problem persist.
This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don’t block and thus cause legitimate users to contact you.
Use Captchas if you suspect that your website is being accessed by a scraper.
Captchas (“Completely Automated Test to Tell Computers and Humans apart”) are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn’t a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don’t roll your own, use something like Google’s reCaptcha: It’s a lot easier than implementing a captcha yourself, it’s more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it’s also a lot harder for a scripter to solve than a simple image served from your site
Don’t include the solution to the captcha in the HTML markup: I’ve actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don’t do something like this. Again, use a service like reCaptcha, and you won’t have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
Serve your text content as an image
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It’s also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it’s also easy to circumvent with some OCR, so don’t do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
Don’t expose your complete dataset:
If feasible, don’t provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don’t have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
The bot / script does not want / need the full dataset anyway.
Your articles are served from a URL which looks something like This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.
There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
Searching for something like “and” or “the” can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
You need search engines to find your content.
Don’t expose your APIs, endpoints, and similar things:
Make sure you don’t expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid! ) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
To deter HTML parsers and scrapers:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Frequently change your HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id’s and classes of elements in your HTML, perhaps even automatically. So, if your ticle-content becomes something like div. a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use div. [any-14-characters] to find the desired div instead. Beware of other similar holes too..
If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.
Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.
Change your HTML based on the user’s location
This is sort of similar to the previous tip. If you serve different HTML based on your user’s location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it’s actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
Frequently change your HTML, actively screw with the scrapers by doing so!
An example: You have a search feature on your website, located at, which returns the following HTML:
Stack Overflow has become the world’s most popular programming Q & A website
The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which…
(And so on, lots more identically structured divs with search results)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here’s how the search results page could be changed:
Stack Overflow has become the world’s most popular programming Q & A website
The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which…
Visit now, for all the latest Stack Overflow related news!
is so awesome, visit now!
This search result is here to prevent scraping
If you’re a human and see this, please ignore it. If you’re a scraper, please click the link below:-) I’m a scraper! Many content producers or site owners get understandably anxious about the thought of a web scraper culling all of their data, and wonder if there’s any technical means for stopping automated harvesting.
Note that clicking the link below will block access to this site for 24 hours.
(The actual, real, search results follow. )
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won’t visit the link. A genuine and desirable spider such as Google’s will not visit the link either because you disallowed /scrapertrap/ in your
You can make your do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
Don’t forget to disallow your honeypot (/scrapertrap/) in your file so that search engine bots don’t fall into it.
You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
Serve fake and useless data if you detect a scraper
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don’t know that they’re being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you’ll make it hard for scrapers to get what they want, namely the actual, real data.
Don’t accept requests if the User Agent is empty / missing
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else.. )
It’s trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Don’t accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
“Mozilla” (Just that, nothing else. I’ve seen a few questions about scraping here, using that. A real browser will never use only that)
“Java 1. 7. 43_u43” (By default, Java’s HttpUrlConnection uses something like this. )
“BIZCO EasyScraping Studio 2. 0”
“wget”, “curl”, “libcurl”,.. (Wget and cURL are sometimes used for basic scraping)
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
If it doesn’t request assets (CSS, images), it’s not a real browser.
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won’t as they are only interested in the actual pages and their content.
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
Use and require cookies; use them to track user and scraper actions.
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it’s probably a scraper.
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
Note that if you use JavaScript to set and retrieve the cookie, you’ll block scrapers which don’t run JavaScript, since they can’t retrieve and send the cookie with their request.
Use JavaScript + Ajax to load your content
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
Be aware of:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
Obfuscate your markup, network requests from scripts, and everything else.
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
There are several disadvantages to doing something like this, though:
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don’t run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
Non-Technical:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
Miscellaneous:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don’t forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there’s one website which has content scraped from yours, other scrapers can scrape from that scraper’s website.
Further reading:
Wikipedia’s article on Web scraping. Many details on the technologies involved and the different types of web scraper.
Stopping scripters from slamming your website hundreds of times a second. Q & A on a very similar problem – bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
Preventing Web Scraping: Best Practices for Keeping Your …
Unfortunately, if your website presents information in a way that a browser can access and render for the average visitor, then that same content can be scraped by a script or application.
Any content that can be viewed on a webpage can be scraped. Period.
You can try checking the headers of the requests – like User-Agent or Cookie – but those are so easily spoofed that it’s not even worth doing.
You can see if the client executes Javascript, but bots can run that as well. Any behavior that a browser makes can be copied by a determined and skilled web scraper.
But while it may be impossible to completely prevent your content from being lifted, there are still many things you can do to make the life of a web scraper difficult enough that they’ll give up or not event attempt your site at all.
Having written a book on web scraping and spent a lot of time thinking about these things, here are a few things I’ve found that a site owner can do to throw major obstacles in the way of a scraper.
1. Rate Limit Individual IP Addresses
If you’re receiving thousands of requests from a single computer, there’s a good chance that the person behind it is making automated requests to your site.
Blocking requests from computers that are making them too fast is usually one of the first measures sites will employ to stop web scrapers.
Keep in mind that some proxy services, VPNs, and corporate networks present all outbound traffic as coming from the same IP address, so you might inadvertently block lots of legitimate users who all happen to be connecting through the same machine.
If a scraper has enough resources, they can circumvent this sort of protection by setting up multiple machines to run their scraper on, so that only a few requests are coming from any one machine.
Alternatively, if time allows, they may just slow their scraper down so that it waits between requests and appears to be just another user clicking links every few seconds.
2. Require a Login for Access
HTTP is an inherently stateless protocol meaning that there’s no information preserved from one request to the next, although most HTTP clients (like browsers) will store things like session cookies.
This means that a scraper doesn’t usually need to identify itself if it is accessing a page on a public website. But if that page is protected by a login, then the scraper has to send some identifying information along with each request (the session cookie) in order to view the content, which can then be traced back to see who is doing the scraping.
This won’t stop the scraping, but will at least give you some insight into who’s performing automated access to your content.
3. Change Your Website’s HTML Regularly
Scrapers rely on finding patterns in a site’s HTML markup, and they then use those patterns as clues to help their scripts find the right data in your site’s HTML soup.
If your site’s markup changes frequently or is thoroughly inconsistent, then you might be able to frustrate the scraper enough that they give up.
This doesn’t mean you need a full-blown website redesign, simply changing the class and id in your HTML (and the corresponding CSS files) should be enough to break most scrapers.
Note that you might also end up driving your web designers insane as well.
4. Embed Information Inside Media Objects
Most web scrapers assume that they’ll simply be pulling a string of text out of an HTML file.
If the content on your website is inside an image, movie, pdf, or other non-text format, then you’ve just added another very huge step for a scraper – parsing text from a media object.
Note that this might make your site slower to load for the average user, way less accessible for blind or otherwise disabled users, and make it a pain to update content.
5. Use CAPTCHAs When Necessary
CAPTCHAs are specifically designed to separate humans from computers by presenting problems that humans generally find easy, but computers have a difficult time with.
While humans tend to find the problems easy, they also tend to find them extremely annoying. CAPTCHAs can be useful, but should be used sparingly.
Maybe only show a CAPTCHA if a particular client has made dozens of requests in the past few seconds.
6. Create “Honey Pot” Pages
Honey pots are pages that a human visitor would never visit, but a robot that’s clicking every link on a page might accidentally stumble across. Maybe the link is set to display:none in CSS or disguised to blend in with the page’s background.
Honey pots are designed more for web crawlers – that is, bots that don’t know all of the URLs they’re going to visit ahead of time, and must simply click all the links on a site to traverse its content.
Once a particular client visits a honey pot page, you can be relatively sure they’re not a human visitor, and start throttling or blocking all requests from that client.
7. Don’t Post the Information on Your Website
This might seem obvious, but it’s definitely an option if you’re really worried about scrapers stealing your information.
Ultimately, web scraping is just a way to automate access to a given website. If you’re fine sharing your content with anyone who visits your site, then maybe you don’t need to worry about web scrapers.
After all, Google is the largest scraper in the world and people don’t seem to mind when Google indexes their content. But if you’re worried about it “falling into the wrong hands” then maybe it shouldn’t be up there in the first place.
Any steps that you take to limit web scrapers will probably also harm the experience of the average web viewer. If you’re posting information on your website for anyone the public to view, then you probably want to allow fast and easy access to it.
This is not only convenient for your visitors, it’s great for web scrapers as well.
Learn More About How Scrapers Work
I’ve written a book called The Ultimate Guide to Web Scraping that includes everything you need to know to extract information from web pages. More Info →
Buy Now
Purchase securely with Paypal or Credit Card.
This article is an expanded version of my answer on Quora to the question: What is the best way to block someone from scraping content from your website?
If you’d like to learn more, check out my book or reach out and say hi!Frequently Asked Questions about prevent web scraping