Are Web Crawlers Legal
Is web crawling legal? – Towards Data Science
Photo by Sebastian Pichler on UnsplashWeb crawling, also known as web scraping, data scraping or spider, is a computer program technique used to scrape a huge amount of data from websites where regular-format data can be extracted and processed into easy-to-read structured crawling basically is how the internet functions. For example, SEO needs to create sitemaps and gives their permissions to let Google crawl their sites in order to make higher ranks in the search results. Many consultant companies would hire companies to specialize in web scraping to enrich their database so as to provide professional service to their is really hard to determine the legality of web scraping in the era of the digitized crawling can be used in the malicious purpose for example:Scraping private or classified information. Disregard of the website’s terms and service, scrape without owners’ abusive manner of data requests would lead web server crashes under additionally heavy is important to note that a responsible data service provider would refuse your request if:The data is private which would need a username and passcodesThe TOS (Terms of Service) explicitly prohibits the action of web scrapingThe data is copyrightedViolation of the Computer Fraud and Abuse Act (CFAA). Violation of the Digital Millennium Copyright Act (DMCA)Trespass to “just scraped a website” may cause unexpected consequences if you used it probably heard of the HiQ vs Linkedin case in 2017. HiQ is a data science company that provides scraped data to corporate HR departments. Linkedin then sent desist letter to stop HiQ scraping behavior. HiQ then filed a lawsuit to stop Linkedin from blocking their access. As a result, the court ruled in favor of HiQ. It is because that HiQ scrapes data from the public profiles on Linkedin without logging in. That said, it is perfectly legal to scrape the data which is publicly shared on the ’s take another example to illustrate in what case web scraping can be harmful. The law case eBay v. Bidder’s Edge. If you’re doing web crawling for your own purposes, it is legal as it falls under fair use doctrine. The complications start if you want to use scraped data for others, especially commercial purposes. Quoted from, 100 1058 (N. D. Cal. 2000), was a leading case applying the trespass to chattels doctrine to online activities. In 2000, eBay, an online auction company, successfully used the ‘trespass to chattels’ theory to obtain a preliminary injunction preventing Bidder’s Edge, an auction data aggregation, from using a ‘crawler’ to gather data from eBay’s website. The opinion was a leading case applying ‘trespass to chattels’ to online activities, although its analysis has been criticized in more recent long as you are not crawling at a disruptive rate and the source is public you should be fine. I suggest you check the websites you plan to crawl for any Terms of Service clauses related to scraping their intellectual property. If it says “no scraping or crawling”, you should respect ggestion:Scrape discreetly, check “” before you start scrapingGo conservative. Aggressively asking for data can burden the internet server. An ethical way is to be gentle. No one wants to crash the the data wisely. Don’t duplicate the data. You can generate insight from collected data, and help Your business out to the owner of the website before you start ’t randomly pass scraped data to anyone. If it is valuable data, keep it secure.
Is Web Scraping Illegal? Depends on What the Meaning of the …
Depending on who you ask, web scraping can be loved or hated.
Web scraping has existed for a long time and, in its good form, it’s a key underpinning of the internet. “Good bots” enable, for example, search engines to index web content, price comparison services to save consumers money, and market researchers to gauge sentiment on social media.
“Bad bots, ” however, fetch content from a website with the intent of using it for purposes outside the site owner’s control. Bad bots make up 20 percent of all web traffic and are used to conduct a variety of harmful activities, such as denial of service attacks, competitive data mining, online fraud, account hijacking, data theft, stealing of intellectual property, unauthorized vulnerability scans, spam and digital ad fraud.
So, is it Illegal to Scrape a Website?
So is it legal or illegal? Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.
Startups love it because it’s a cheap and powerful way to gather data without the need for partnerships. Big companies use web scrapers for their own gain but also don’t want others to use bots against them.
The general opinion on the matter does not seem to matter anymore because in the past 12 months it has become very clear that the federal court system is cracking down more than ever.
Let’s take a look back. Web scraping started in a legal grey area where the use of bots to scrape a website was simply a nuisance. Not much could be done about the practice until in 2000 eBay filed a preliminary injunction against Bidder’s Edge. In the injunction eBay claimed that the use of bots on the site, against the will of the company violated Trespass to Chattels law.
The court granted the injunction because users had to opt in and agree to the terms of service on the site and that a large number of bots could be disruptive to eBay’s computer systems. The lawsuit was settled out of court so it all never came to a head but the legal precedent was set.
In 2001 however, a travel agency sued a competitor who had “scraped” its prices from its Web site to help the rival set its own prices. The judge ruled that the fact that this scraping was not welcomed by the site’s owner was not sufficient to make it “unauthorized access” for the purpose of federal hacking laws.
Two years later the legal standing for eBay v Bidder’s Edge was implicitly overruled in the “Intel v. Hamidi”, a case interpreting California’s common law trespass to chattels. It was the wild west once again. Over the next several years the courts ruled time and time again that simply putting “do not scrape us” in your website terms of service was not enough to warrant a legally binding agreement. For you to enforce that term, a user must explicitly agree or consent to the terms. This left the field wide open for scrapers to do as they wish.
Fast forward a few years and you start seeing a shift in opinion. In 2009 Facebook won one of the first copyright suits against a web scraper. This laid the groundwork for numerous lawsuits that tie any web scraping with a direct copyright violation and very clear monetary damages. The most recent case being AP v Meltwater where the courts stripped what is referred to as fair use on the internet.
Previously, for academic, personal, or information aggregation people could rely on fair use and use web scrapers. The court now gutted the fair use clause that companies had used to defend web scraping. The court determined that even small percentages, sometimes as little as 4. 5% of the content, are significant enough to not fall under fair use. The only caveat the court made was based on the simple fact that this data was available for purchase. Had it not been, it is unclear how they would have ruled. Then a few months back the gauntlet was dropped.
Andrew Auernheimer was convicted of hacking based on the act of web scraping. Although the data was unprotected and publically available via AT&T’s website, the fact that he wrote web scrapers to harvest that data in mass amounted to “brute force attack”. He did not have to consent to terms of service to deploy his bots and conduct the web scraping. The data was not available for purchase. It wasn’t behind a login. He did not even financially gain from the aggregation of the data. Most importantly, it was buggy programing by AT&T that exposed this information in the first place. Yet Andrew was at fault. This isn’t just a civil suit anymore. This charge is a felony violation that is on par with hacking or denial of service attacks and carries up to a 15-year sentence for each charge.
In 2016, Congress passed its first legislation specifically to target bad bots — the Better Online Ticket Sales (BOTS) Act, which bans the use of software that circumvents security measures on ticket seller websites. Automated ticket scalping bots use several techniques to do their dirty work including web scraping that incorporates advanced business logic to identify scalping opportunities, input purchase details into shopping carts, and even resell inventory on secondary markets.
To counteract this type of activity, the BOTS Act:
Prohibits the circumvention of a security measure used to enforce ticket purchasing limits for an event with an attendance capacity of greater than 200 persons.
Prohibits the sale of an event ticket obtained through such a circumvention violation if the seller participated in, had the ability to control, or should have known about it.
Treats violations as unfair or deceptive acts under the Federal Trade Commission Act. The bill provides authority to the FTC and states to enforce against such violations.
In other words, if you’re a venue, organization or ticketing software platform, it is still on you to defend against this fraudulent activity during your major onsales.
The UK seems to have followed the US with its Digital Economy Act 2017 which achieved Royal Assent in April. The Act seeks to protect consumers in a number of ways in an increasingly digital society, including by “cracking down on ticket touts by making it a criminal offence for those that misuse bot technology to sweep up tickets and sell them at inflated prices in the secondary market. ”
In the summer of 2017, LinkedIn sued hiQ Labs, a San Francisco-based startup. hiQ was scraping publicly available LinkedIn profiles to offer clients, according to its website, “a crystal ball that helps you determine skills gaps or turnover risks months ahead of time. ”
You might find it unsettling to think that your public LinkedIn profile could be used against you by your employer.
Yet a judge on Aug. 14, 2017 decided this is okay. Judge Edward Chen of the U. S. District Court in San Francisco agreed with hiQ’s claim in a lawsuit that Microsoft-owned LinkedIn violated antitrust laws when it blocked the startup from accessing such data. He ordered LinkedIn to remove the barriers within 24 hours. LinkedIn has filed to appeal.
The ruling contradicts previous decisions clamping down on web scraping. And it opens a Pandora’s box of questions about social media user privacy and the right of businesses to protect themselves from data hijacking.
There’s also the matter of fairness. LinkedIn spent years creating something of real value. Why should it have to hand it over to the likes of hiQ — paying for the servers and bandwidth to host all that bot traffic on top of their own human users, just so hiQ can ride LinkedIn’s coattails?
I am in the business of blocking bots. Chen’s ruling has sent a chill through those of us in the cybersecurity industry devoted to fighting web-scraping bots.
I think there is a legitimate need for some companies to be able to prevent unwanted web scrapers from accessing their site.
In October of 2017, and as reported by Bloomberg, Ticketmaster sued Prestige Entertainment, claiming it used computer programs to illegally buy as many as 40 percent of the available seats for performances of “Hamilton” in New York and the majority of the tickets Ticketmaster had available for the Mayweather v. Pacquiao fight in Las Vegas two years ago.
Prestige continued to use the illegal bots even after it paid a $3. 35 million to settle New York Attorney General Eric Schneiderman’s probe into the ticket resale industry.
Under that deal, Prestige promised to abstain from using bots, Ticketmaster said in the complaint. Ticketmaster asked for unspecified compensatory and punitive damages and a court order to stop Prestige from using bots.
Are the existing laws too antiquated to deal with the problem? Should new legislation be introduced to provide more clarity? Most sites don’t have any web scraping protections in place. Do the companies have some burden to prevent web scraping?
As the courts try to further decide the legality of scraping, companies are still having their data stolen and the business logic of their websites abused. Instead of looking to the law to eventually solve this technology problem, it’s time to start solving it with anti-bot and anti-scraping technology today.
Get the latest from imperva
The latest news from our experts in the fast-changing world of application, data, and edge security.
Subscribe to our blog
Web Scraping and Crawling Are Perfectly Legal, Right?
“Come on, I worked so hard on this project! And this is publicly accessible data! There’s certainly a way around this, right? Or else, I did all of this for nothing… Sigh… ”
Yep – this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.
Interestingly, I’ve been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.
So this is what this post is all about – understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.
Disclaimer: I’m not a lawyer. I’m simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.
What are web scraping and crawling?
Let’s first define these terms to make sure that we’re on the same page.
Web scraping: the act of automatically downloading a web page’s data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc. ).
Web crawling: the act of automatically downloading a web page’s data, extracting the hyperlinks it contains and following them. The downloaded data is generally stored in an index or a database to make it easily searchable.
For example, you may use a web scraper to extract weather forecast data from the National Weather Service. This would allow you to further analyze it.
In contrast, you may use a web crawler to download data from a broad range of websites and build a search engine. Maybe you’ve already heard of Googlebot, Google’s own web crawler.
So web scrapers and crawlers are generally used for entirely different purposes.
Why is web scraping often seen negatively?
The reputation of web scraping has gotten a lot worse in the past few years, and for good reasons:
It’s increasingly being used for business purposes to gain a competitive advantage. So there’s often a financial motive behind it.
It’s often done in complete disregard of copyright laws and of Terms of Service (ToS).
It’s often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites. They might also choose to stay anonymous and not identify themselves. Finally, they might also perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.
Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e. g. Facebook, LinkedIn, etc. ) and online stores (e. Amazon). This is probably why Facebook has separate terms for automated data collection.
In contrast, web crawling has historically been used by the well-known search engines (e. Google, Bing, etc. ) to download and index the web. These companies have built a good reputation over the years, because they’ve built indispensable tools that add value to the websites they crawl. So web crawling is generally seen more favorably, although it may sometimes be used in abusive ways as well.
So is it legal or illegal?
Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.
The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS). You’re essentially putting yourself in a vulnerable position.
Just think about it; you’re using the bandwidth of somebody else, and you’re freely retrieving and using their data. It’s reasonable to think that they might not like it, because what you’re doing might hurt them in some way. So depending on many factors (and what mood they’re in), they’re perfectly free to pursue legal action against you.
I know what you may be thinking. “Come on! This is ridiculous! Why would they sue me? “. Sure, they might just ignore you. Or they might simply use technical measures to block you. Or they might just send you a cease and desist letter. But technically, there’s nothing that prevents them from suing you. This is the real problem.
Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let’s see:
Violation of the Computer Fraud and Abuse Act (CFAA).
Violation of California Penal Code.
Violation of the Digital Millennium Copyright Act (DMCA).
Breach of contract.
Trespass.
Misappropriation.
That lawsuit is pretty concerning, because it’s really not clear what will happen to those “anonymous” people.
Consider that if you ever get sued, you can’t simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it’s fair, or whether or not what you did is really illegal.
Another problem is that law isn’t like anything you’re probably used to. Because where you use logic, common sense and your technical expertise, they’ll use legal jargon and some grey areas of law to prove that you did something wrong. This isn’t a level playing field. And it certainly isn’t a good situation to be in. So you’ll need to get a lawyer, and this might cost you a lot of money.
Besides, based on the above lawsuit by LinkedIn, you can see that cases can undoubtedly become quite complex and very broad in scope, even though you “just scraped a website”.
The typical counterarguments brought by people
I found that people generally try to defend their web scraping or crawling activities by downplaying their importance. And they do so typically by using the same arguments over and over again.
So let’s review the most common ones:
“I can do whatever I want with publicly accessible data. ”
False. The problem is that the “creative arrangement” of data can be copyrighted, as described on
Facts cannot be copyrighted. However, the creative selection, coordination and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.
So a website – including its pages, design, layout and database – can be copyrighted, because it’s considered as a creative work. And if you scrape that website to extract data from it, the simple fact of copying a web page in memory with your web scraper might be considered as a copyright violation.
In the United States, copyrighted work is protected by the Digital Millenium Copyright Act (DMCA).
“This is fair use! ”
This is a grey area:
In Kelly v. Arriba Soft Corp., the court found that the image search engine made fair use of a professional photographer’s pictures by displaying thumbnails of them.
In Associated Press v. Meltwater U. S. Holdings, Inc., the court found that Meltwater’s news aggregator service didn’t make fair use of Associated Press’ articles, even though scraped articles were only displayed as excerpts of the originals.
“It’s the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway! ”
False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You’re legally bound by those terms; it doesn’t matter that you could get that data manually.
“The worse that might happen if I break their Terms of Service is that I might get banned or blocked. ”
In Facebook v. Pete Warden, Facebook’s attorney threatened Mr. Warden to sue him if he published his dataset comprised of hundreds of million of scraped Facebook profiles.
In Linkedin Corporation v. Michael George Keating, Linkedin blocked Mr. Keating from accessing Linkedin because he had created a tool that they thought was made to scrape their website. They were wrong. But yet, he has never been able to restore his account. Fortunately, this case didn’t go further.
In LinkedIn Corporation v. Robocog Inc, Robocog Inc. (a. k. a. HiringSolved) was ordered to pay 40000$ to Linkedin for their unauthorized scraping of the site.
“This is completely unfair! Google has been crawling/scraping the whole web since forever! ”
True. But law has apparently nothing to do with fairness. It’s based on rules, interpreted by people.
“If I ever get sued, I’ll Good-Will-Hunting my way into defending myself. ”
Good luck! Unless you know law and legal jargon extensively. Personally, I don’t.
“But I used an automated script, so I didn’t enter into any contract with the website. ”
In Internet Archive v. Suzanne Shell, Internet Archive was found guilty of breach of contract while copying and archiving pages from Mrs. Shell’s website using its web crawlers. On her website, Mrs. Shell displays a warning stating that as soon as you copy content from her website, you enter into a contract, and you owe her 5000$US per page copied (!!! ). The two parties apparently reached an amicable resolution.
In Southwest Airlines Co. v. BoardFirst, LLC, BoardFirst was found guilty of violating a browsewrap contract displayed on Southwest Airlines’ website. BoardFirst had created a tool that automatically downloaded the boarding passes of Southwest’s customers to offer them better seats.
“Terms of Service (ToS) are not enforceable anyway. They have no legal value. The Bingham McCutchen LLP law firm published a pretty extensive article on
this matter and they state that:
As is the general rule with any contract, a website’s terms of use will generally be deemed enforceable if mutually agreed to by the parties. [… ] Regardless of whether a website’s terms of use are clickwrap or browsewrap, the defendant’s failure to read those terms is generally found irrelevant to the enforceability of its terms. One court disregarded arguments that awareness of a website’s terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms. Similarly, one court imputed knowledge of a website’s terms of use to a defendant who had repeatedly accessed that website using such tools. Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e. g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.
In other words, Terms of Service (ToS) will be legally enforced depending on the court, and if there’s sufficient proof that you were aware of them.
“I respected their and I crawled at a reasonable speed, so I can’t possibly get into trouble, right? ”
This is a grey area.
is recognized as a “technological tool to deter unwanted crawling or scraping”. But whether or not you respect it, you’re still bound to the Terms of Service (ToS).
“Okay, but this is for personal use. For my personal research only. I won’t re-publish it, or publish any derivative dataset, or even sell it. So I’m good to go, right? ”
This is a grey area. Terms of Service (ToS) often prohibit automatic data collection, for any purpose.
According to the Bingham McCutchen LLP law firm:
The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.
“But the website has no So I can do what I want, right? ”
False. You’re still bound to the Terms of Service (ToS), and the content is copyrighted.
General advice for your scraping or crawling projects
Based on the above, you can certainly guess that you should be extra cautious with web scraping and crawling.
Here are a few pieces of advice:
Use an API if one is provided, instead of scraping data.
Respect the Terms of Service (ToS).
Respect the rules of
Use a reasonable crawl rate, i. e. don’t bombard the site with requests. Respect the crawl-delay setting provided in; if there’s none, use a conservative crawl rate (e. 1 request per 10-15 seconds).
Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you’re doing and why, and link back to the page in your user agent string (e. ‘MY-BOT (+)’)
If ToS or prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
Don’t republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
If you doubt on the legality of what you’re doing, don’t do it. Or seek the advice of a lawyer.
Don’t base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. 3Taps Inc..
Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.
Remember that companies and individuals are perfectly free to sue you, for whatever reasons they want. This is most likely not the first step that they’ll take. But if you scrape/crawl their website without permission and you do something that they don’t like, you definitely put yourself in a vulnerable position.
Conclusion
As we’ve seen in this post, web scraping and crawling aren’t illegal by themselves. They might become problematic when you play on somebody else’s turf, on your own terms, without obtaining their prior permission. The same is true in real life as well, when you think about it.
There are a lot of grey areas in law around this topic, so the outcome is pretty unpredictable. Before getting into trouble, make sure that what you’re doing respects the rules.
And finally, the relevant question isn’t “Is this legal? “. Instead, you should ask yourself “Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response? “.
So I hope that you appreciated my post! Feel free to leave a comment in the comment section below!
This post was featured on Hacker News, Reddit, Lobsters and in the Programming Digest newsletter. Thanks to everyone for your support and feedback!
Frequently Asked Questions about are web crawlers legal
Is crawling a website illegal?
Web scraping and crawling aren’t illegal by themselves. … Web scraping started in a legal grey area where the use of bots to scrape a website was simply a nuisance. Not much could be done about the practice until in 2000 eBay filed a preliminary injunction against Bidder’s Edge.
Do you need permission for web scraping?
Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS).Apr 18, 2017
Is crawling on Youtube legal?
Violation of ToS by itself is not (or rather should not) be illegal, but it it is a contract violation; but you might be doing things that are also criminal, depending on how exactly you perform the said scraping (e.g. computer fraud for bypassing digital security).Jun 2, 2015