Web Scraping Best Practices

Web Scraping Best Practices

November 16, 2021
0

How to scrape websites without getting blocked - ScrapeHero

How to scrape websites without getting blocked – ScrapeHero

Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site. While most websites may not have anti-scraping mechanisms, some sites use measures that can lead to web scraping getting blocked, because they do not believe in open data access.
If a crawler performs multiple requests per second and downloads large files, an under-powered server would have a hard time keeping up with requests from multiple crawlers. Since web crawlers, scrapers or spiders (words used interchangeably) don’t really drive human website traffic and seemingly affect the performance of the site, some site administrators do not like spiders and try to block their access.
In this article, we will talk about the best web scraping practices to follow to scrape websites without getting blocked by the anti-scraping or bot detection tools.
Web Scraping best practices to follow to scrape without getting blocked
Respect
Make the crawling slower, do not slam the server, treat websites nicely
Do not follow the same crawling pattern
Make requests through Proxies and rotate them as needed
Rotate User Agents and corresponding HTTP Request Headers between requests
Use a headless browser like Puppeteer, Selenium or Playwright
Beware of Honey Pot Traps
Check if Website is Changing Layouts
Avoid scraping data behind a login
Use Captcha Solving Services
How can websites detect web scraping?
How do you find out if a website has blocked or banned you?
Basic Rule: “Be Nice”
An overarching rule to keep in mind for any kind of web scraping is
BE GOOD AND FOLLOW A WEBSITE’S CRAWLING POLICIES
Here are the web scraping best practices you can follow to avoid getting web scraping blocked:
Web spiders should ideally follow the file for a website while scraping. It has specific rules for good behavior such as how frequently you can scrape, which pages allow scraping, and which ones you can’t. Some websites allow Google to scrape their websites, by not allowing any other websites to scrape. This goes against the open nature of the Internet and may not seem fair but the owners of the website are within their rights to resort to such behavior.
You can find the file on websites. It is usually the root directory of a website –
If it contains lines like the ones shown below, it means the site doesn’t like and does not want to be scraped.
User-agent: *
Disallow:/
However, since most sites want to be on Google, arguably the largest scraper of websites globally, they do allow access to bots and spiders.
What if you need some data, that is forbidden by You could still go and scrape it. Most anti-scraping tools block web scraping when you are scraping pages that are not allowed by
What do these tools look for – is this client a bot or a real user. And how do they find that? By looking for a few indicators that real users do and bots don’t. Humans are random, bots are not. Humans are not predictable, bots are.
Here are a few easy giveaways that you are bot/scraper/crawler –
scraping too fast and too many pages, faster than a human ever can
following the same pattern while crawling. For example – go through all pages of search results, and go to each result only after grabbing links to them. No human ever does that.
too many requests from the same IP address in a very short time
not identifying as a popular browser. You can do this by specifying a ‘User-Agent’.
using a user agent string of a very old browser
The points below should get you past most of the basic to intermediate anti-scraping mechanisms used by websites to block web scraping.
Web scraping bots fetch data very fast, but it is easy for a site to detect your scraper as humans cannot browse that fast. The faster you crawl, the worse it is for everyone. If a website gets too many requests than it can handle it might become unresponsive.
Make your spider look real, by mimicking human actions. Put some random programmatic sleep calls in between requests, add some delays after crawling a small number of pages and choose the lowest number of concurrent requests possible. Ideally put a delay of 10-20 seconds between clicks and not put much load on the website, treating the website nice.
Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trials runs. Do this periodically because the environment does change over time.
Humans generally will not perform repetitive tasks as they browse through a site with random actions. Web scraping bots tend to have the same crawling pattern because they are programmed that way unless specified. Sites that have intelligent anti-crawling mechanisms can easily detect spiders by finding patterns in their actions and can lead to web scraping getting blocked.
Incorporate some random clicks on the page, mouse movements and random actions that will make a spider look like a human.
When scraping, your IP address can be seen. A site will know what you are doing and if you are collecting data. They could take data such as – user patterns or experience if they are first time users.
Multiple requests coming from the same IP will lead you to get blocked, which is why we need to use multiple addresses. When we send requests from a proxy machine, the target website will not know where the original IP is from, making the detection harder.
Create a pool of IPs that you can use and use random ones for each request. Along with this, you have to spread a handful of requests across multiple IPs.
There are several methods can be used to change your outgoing IP.
TOR
VPNs
Free Proxies
Shared Proxies – the least expensive proxies, shared by many users. Chances to get blocked are high.
Private Proxies – usually used only by you, and lower chances of getting blocked if you keep the frequency low.
Data Center Proxies, if you need a large number of IP Address and faster proxies, larger pools of IPs. They are cheaper than residential proxies and coulde be detected easily.
Residential Proxies, if you are making a huge number of requests to websites that block to actively. These are very expensive (and could be slower, as they are real devices). Try everything else before getting a residential proxy.
In addition, various commercial providers also provide services for automatic IP rotation. A lot of companies now provide residential IPs to make scraping even easier – but most are expensive.
Learn More:
How To Rotate Proxies and IP Addresses using Python 3
How to make anonymous requests using TorRequests and Python
A user agent is a tool that tells the server which web browser is being used. If the user agent is not set, websites won’t let you view content. Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot. You can get your User-Agent by typing ‘what is my user agent’ in Google’s search bar. The only way to make your User-Agent appear more real and bypass detection is to fake the user agent. Most web scrapers do not have a User Agent by default, and you need to add that yourself.
You could even pretend to be the Google Bot: Googlebot/2. 1 if you want to have some fun! ()
Now, just sending User-Agents alone would get you past most basic bot detection scripts and tools. If you find your bots getting blocked even after putting in a recent User-Agent string, you should add some more request headers.
Most browsers sends more headers to the websites than just the User-Agent. For example, here is a set of headers a browser sent to (Our Web Scraping Test Site). It would be ideal to send these common request headers too.
The most basic ones are:
User-Agent
Accept
Accept-Language
Referer
DNT
Updgrade-Insecure-Requests
Cache-Control
Do not send cookies unless your scraper depends on Cookies for functionality.
You can find the right values for these by inspecting your web traffic using Chrome Developer Tools, or a tool like MitmProxy or Wireshark. You can also copy a curl command to your request from them. For example
curl ” \
-H ‘authority: ‘ \
-H ‘dnt: 1’ \
-H ‘upgrade-insecure-requests: 1’ \
-H ‘user-agent: Mozilla/5. 0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/83. 0. 4103. 61 Safari/537. 36’ \
-H ‘accept: text/html, application/xhtml+xml, application/xml;q=0. 9, image/webp, image/apng, */*;q=0. 8, application/signed-exchange;v=b3;q=0. 9’ \
-H ‘sec-fetch-site: none’ \
-H ‘sec-fetch-mode: navigate’ \
-H ‘sec-fetch-user:? 1’ \
-H ‘sec-fetch-dest: document’ \
-H ‘accept-language: en-GB, en-US;q=0. 9, en;q=0. 8’ \
–compressed
You can get this converted to any language using a tool like Here is how this was converted to python
import requests
headers = {
‘authority’: ”,
‘dnt’: ‘1’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5. 36’,
‘accept’: ‘text/html, application/xhtml+xml, application/xml;q=0. 9’,
‘sec-fetch-site’: ‘none’,
‘sec-fetch-mode’: ‘navigate’,
‘sec-fetch-user’: ‘? 1’,
‘sec-fetch-dest’: ‘document’,
‘accept-language’: ‘en-GB, en-US;q=0. 8’, }
response = (”, headers=headers)
You can create similar header combinations for multiple browsers and start rotating those headers between each request to reduce the chances of getting your web scraping blocked.
If none of the methods above works, the website must be checking if you are a REAL browser.
The simplest check is if the client (web browser) can render a block of JavaScript. If it doesn’t, then it pretty much flags the visitor to be a bot. While it is possible to block running JavaScript in the browser, most of the Internet sites will be unusable in such a scenario and as a result, most browsers will have JavaScript enabled.
Once this happens, a real browser is necessary in most cases to scrape the data. There are libraries to automatically control browser such as
Selenium
Puppeteer and Pyppeteer
Playwright
Anti Scraping tools are smart and are getting smarter daily, as bots feed a lot of data to their AIs to detect them. Most advanced Bot Mitigation Services use Browser Side Fingerprinting (Client Side Bot Detection) by more advanced methods than just checking if you can execute Javascript.
Bot detection tools look for any flags that can tell them that the browser is being controlled through an automation library.
Presence of bot specific signatures
Support for nonstandard browser features
Presence of common automation tools such as Selenium, Puppeteer, Playwright, etc.
Human-generated events such as randomized Mouse Movement, Clicks, Scrolls, Tab Changes etc.
All this information is combined to construct a unique client-side fingerprint that can tag one as bot or human.
Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned.
Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information. It is usually an application that imitates the behavior of a real system. Some websites install honeypots, which are links invisible to normal users but can be seen by web scrapers.
When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will have the CSS style display:none or will be color disguised to blend in with the page’s background color.
This detection is obviously not easy and requires a significant amount of programming work to accomplish properly, as a result, this technique is not widely used on either side – the server side or the bot or scraper side.
Some websites make it tricky for scrapers, serving slightly different layouts.
For example, in a website pages 1-20 will display a layout, and rest of the pages may display something else. To prevent this, check if you are getting data scraped using XPaths or CSS selectors. If not, check how the layout is different and add a condition in your code to scrape those pages differently.
Login is basically permission to get access to web pages. Some websites like Indeed and Facebook do not allow permission.
If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page. This makes it easy for the target website to see requests coming from the same address. They could take away your credentials or block your account which can in turn lead to your web scraping efforts being blocked.
Its generally preferred to avoid scraping websites that have a login as you will get blocked easily, but one thing you can do is imitate human browsers whenever authentication is required you get the target data you need.
Many websites use anti web scraping measures. If you are scraping a website on a large scale, the website will eventually block you. You will start seeing captcha pages instead of web pages. There are services to get past these restrictions such as 2Captcha or Anticaptcha.
If you need to scrape websites that use Captcha, it is better to resort to captcha services. Captcha services are relatively cheap, which is useful when performing large scale scrapes.
How can websites detect and block web scraping?
Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of these methods are enumerated below:
Unusual traffic/high download rate especially from a single client/or IP address within a short time span.
Repetitive tasks performed on the website in the same browsing pattern – based on an assumption that a human user won’t perform the same repetitive tasks all the time.
Checking if you are real browser – A simple check is to try and execute javascript. Smarter tools can go a lot more and check your Graphic cards and CPUs to make sure you are coming from real browser.
Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.
How to address this detection and avoid web scraping getting blocked?
Spend some time upfront and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.
If any of the following signs appear on the site that you are crawling, it is usually a sign of being blocked or banned.
CAPTCHA pages
Unusual content delivery delays
Frequent response with HTTP 404, 301 or 50x errors
Frequent appearance of these HTTP status codes is also indication of blocking
301 Moved Temporarily
401 Unauthorized
403 Forbidden
404 Not Found
408 Request Timeout
429 Too Many Requests
503 Service Unavailable
Here is what tells you when you are blocked.
To discuss automated access to Amazon data please contact
For information about migrating to our APIs refer to our Marketplace APIs at or our Product Advertising API at for advertising use cases.
Sorry! Something went wrong!
With pictures of cute dog of Amazon.
You may also see response or message from website like these ones from some popular anti scraping tools.
We want to make sure it is actually you that we are dealing with and not a robot
Please check the box below to access the site

Why is this verification required? Something about the behaviour of the browser has caught our attention.
There are various possible explanations for this:
you are browsing and clicking at a speed much faster than expected of a human being
something is preventing Javascript from working on your computer
there is a robot on the same network (IP address) as you
Having problems accessing the site? Contact Support
Authenticate your robot
or
Please verify you are a human

Access to this page has been denied because we believe you are using automation tools to browse the website
This may happen as a result of the following:
Javascript is disabled or blocked by an extension (ad blockers for example)
Your browser does not support cookies
Please make sure that Javascript and cookies are enabled on your bowser and that you are not blocking them from loading
Pardon our interruption
As you were browsing something about your browser made us think you were a bot. There are few reasons this might happen
You’re a power user using moving through this website with super-human speed
You’ve disabled JavaScript in your web browser
A third-party bowser plugin such as Ghostery or NoScript, is preventing Javascript from running. Additional information is available in this support article.
After completing the CAPTCHA below, you will immediately regain access to
Error 1005 Ray ID: •
Access denied
What happened?
The owner of this website () has banned the autonomous system number (ASN) your IP address is in () from accessing this website.
A comprehensive list of HTTP return codes (successes and failures) can be found here. It will be worth your time to read through these codes and be familiar with them.
Summary
All these ideas above provide a starting point for you to build your own solutions or refine your existing solution. If you have any ideas or suggestions, please join the discussion in the comments section.
Thank you for reading.
Or you can ignore everything above, and just get the data delivered to you as a service. Interested?
Turn the Internet into meaningful, structured and usable data
Web Scraping: Introduction, Best Practices & Caveats - Medium

Web Scraping: Introduction, Best Practices & Caveats – Medium

Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:Using web scraping, Marketing & Sales companies can fetch lead-related scraping is useful for Real Estate businesses to get the data of new projects, resale properties, comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code raping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and raping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for let’s start, we will discuss how to scrape a page and the different libraries available in Python is the most popular language for scraping. 1. Requests — HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too. 2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup. 3. Python Scrapy Framework:Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:Now let’s write a custom spider to parse a ’s it. Your first custom spider is created. Now. let’s understand the Name of the spider. In this case, it’s “blogspider”. start_urls: A list of URLs where the spider will begin to crawl (self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(.. ) you run this, Scrapy will look for start URL and will give you all the divs of the class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse ’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page. 4. Python library: This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this ’s look at the challenges and solutions while scraping at large scale, i. e., scraping 100–200 websites regularly:Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for ttern Changes: Scraping heavily relies on user interface and its structure, i. e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e. g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support Script-based dynamic content: Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad dataMore Data, More Time: This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites. 8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas. 9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervalsRespect the file: is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal not hit the servers too frequently: As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in or use a standard delay of 10 seconds. This also helps you not to get blocked by the target Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it. Don’t hide who you are. If possible, share your ’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:Follow target URLs rules while scraping. Don’t make them block your intenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping respect the rules of the websites you plan to crawl. If APIs are available, always use them first. *******************************************************************This post was originally published on Velotio lotio Technologies is an outsourced software product development partner for technology startups and enterprises. We specialize in enterprise B2B and SaaS product development with a focus on artificial intelligence and machine learning, DevOps, and test engineering. We combine innovative ideas with business expertise and cutting-edge technology to drive business success for our terested in learning more about us? We would love to connect with you on our Website, LinkedIn or Twitter. *******************************************************************
Best Practices For Web Scraping - Zyte

Best Practices For Web Scraping – Zyte

Don’t be a burdenDon’t violate copyrightDon’t breach GDPRBeware of login and website terms and conditions
At Zyte (formerly Scrapinghub), we care about ensuring that our services respect the rights of websites and companies whose data we scrape.
We hear a lot that scraping is a legal grey area, but the truth is scraping itself isn’t illegal. It’s the manner in which you scrape and what you scrape that falls into the grey area.
In this article, we’ll give you a set of guidelines to follow when scraping the web so you know when you need to be cautious about the manner and type of data you scrape.
Disclaimer: We are not your lawyer, and the recommendations in this guide do not constitute legal advice. Our Head of Legal is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Zyte (formerly Scrapinghub)’s experience helping our clients (startups to Fortune 100’s) maintain legal compliance whilst scraping 7 billion web pages per month. If you want assistance with your specific situation then you should consult a lawyer.
Don’t be a burden
The first rule of scraping the web is: do not harm the website. The second rule of web crawling is: do NOT harm the website.
This means that the volume and frequency of queries you make should not burden the website’s servers or interfere with the website’s normal operations.
You can accomplish this in a number of ways:
Limit the number of concurrent requests to the same website from a single spect the delay that crawlers should wait between requests by following the crawl-delay directive outlined in the possible it is more respectful if you can schedule your crawls to take place at the website’s off-peak hours.
A crucial aspect of this rule is providing the web administrators of the websites you scrape with an easy way to contact you. At Zyte (formerly Scrapinghub) we accomplish this by making an abuse report available on our website. If you ever receive an abuse report from a website you are scraping you should either stop scraping the site or limit the scraping in order to rectify the abuse reported.
Don’t violate copyright
When scraping a website you should always consider whether the web data you are planning to extract is copyrighted.
Copyright is defined as the exclusive legal right over a physical piece of work — like an article, picture, movie, etc. It basically means, if you create it, you own it. In order to be copyrightable, the work needs to be original and tangible.
The common types of material on the web that might be copyrighted are:
ArticlesVideosPicturesStoriesMusicDatabases
As a result, copyright is very relevant to scraping because much of the data on the internet (like articles and videos) are copyrighted works.
However, there are some situations when exceptions can apply to all or part of the data enabling it to be legally scraped without infringing on the owner’s copyright.
Fair use:
Fair Use is an exception that permits limited use of copyrighted material. Typically, fair use includes categories such as criticism/parody, comment, news reporting, teaching, scholarship, and research. One example of fair use is the publishing of short snippets of articles with links, which is generally okay under the fair use exception due to the transformative and limited nature of the factors commonly used to determine if the fair use exception applies are:
the purpose and character of your use (ie is it transformative in some way);the nature of the work (ie fact v. fiction or published v. unpublished); the amount taken, the less you copy the better; and the effect upon the potential market, meaning the extent to which your use may deprive the owner of income or a potential market opportunity.
Transformative use:
One factor in determining fair use is whether the usage is transformative. Instead of distributing and storing exact duplicates or lengthy portions of the crawled website, transform the content and the use of the content in some way so that you are not violating copyright.
Facts:
The facts within copyrighted material are often not covered by copyright laws, so if you limit what is being scraped to just the factual matters — ie names of products, price, etc, then it is acceptable to scrape.
Note that different countries have different exceptions to copyright law, and you should always ensure that an exception applies within the jurisdiction within which you’re operating.
The introduction of GDPR completely changes how you can scrape the personal data of EU citizens (and sometimes non-EU citizens as well). For a deeper explanation of how GDPR affects web scrapers, be sure to check out our Web Scrapers Guide to GDPR.
However, in this section, we will briefly outline the best practices when it comes to scraping personal data. Personal data is any data that can identify an individual person:
NameEmailPhone numberAddressUser nameIP addressBank or credit card infoMedical dataBiometric data
Unless you have a “lawful reason” to scrape and store this data you will be in breach of GDPR if any of the scraped data belongs to EU residents. In the case of web scraping, the most common legal reasons are legitimate interest and consent.
Consent
For consent to be your lawful reason to scrape a person’s data, you need to have that person’s explicit consent to scrape, store, and use their data in the way you intended. This means that you or a 3rd party must have been in direct contact with the person and they agreed to terms that allow you to scrape their data.
An example of this would be companies like, where users give Mint consent to log into their online banking accounts and retrieve their banking transactions so that they can be tracked and displayed in a more user-friendly format on
Legitimate interest
For most companies, it will be very difficult for you to demonstrate that you have a legitimate interest in scraping someone’s personal data.
In most cases, only governments, law enforcement agencies, etc. will have what would be deemed to be a legitimate interest in scraping the personal data of its citizens as they will typically be scraping people’s personal data for the public good.
Beware of login and website terms and conditions
When you log in and/or explicitly agree to a website’s terms and conditions you are entering into a contract with the website owner, thereby agreeing to their rules regarding web scraping. Which can explicitly state that you aren’t allowed to scrape any data on the website.
This means that you need to carefully review the terms and conditions you are agreeing to if your spiders have to log in to scrape data, as they could stipulate that you’re not allowed to scrape their data. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.
Looking for web extracted data? We extract the data you need and deliver it exactly as you’d like it. Just tell us what you need.
Learn more about web scraping
Here at Zyte, we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1, 000 clients ranging from Government Agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction.
Here are some of our best resources if you want to deepen your web scraping knowledge:
Developer tools that make web scraping a breezeWeb scraping: Best practicesEnterprise web scraping: A guide to scraping at scaleLegal compliance in web scrapingThe build in-house or outsource decisionWhat you can use web scraping for

Frequently Asked Questions about web scraping best practices

How can I get better at web scraping?

Top 7 Web Scraping Tips#1 Respect the website and its users. Our first advice is quite a common one: respect the site you’re scraping. … #2 Simulate human behaviour. … #3 Detect when you’ve been blocked. … #4 Avoid being blocked again. … #5 Use Headless Browser. … #6 Use the correct proxies and tools. … #7 Build a Web Crawler.Aug 27, 2020

Which language is best for web scraping?

Python is mostly known as the best web scraper language. It’s more like an all-rounder and can handle most of the web crawling related processes smoothly. Beautiful Soup is one of the most widely used frameworks based on Python that makes scraping using this language such an easy route to take.Aug 9, 2017

Can you get in trouble for web scraping?

Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. … The court granted the injunction because users had to opt in and agree to the terms of service on the site and that a large number of bots could be disruptive to eBay’s computer systems.

ProxyBoys