Vpn Web Scraping
How to webscrape with VPN in Python? – Stack Overflow
I have made a Python program that webscrapes IMDB with Beautifulsoup to make a mySQL database with tables of all the top rated movies in the different categories. So far so good. My problem is that I am doing this from Norway, and many of the movie titles are translated to Norwegian. For example, in the top list of IMDB opened from a Norwegian IP adress, “The Shawshank Redemption” is translated to “Frihetens Regn”. I want all the titles in English. Are there maybe some free VPNs that you can activate from Python and that works with Beautifulsoup? Or do anyone have another solution to this?
asked Dec 28 ’19 at 14:46
1
You have a couple options, VPN and Proxy.
First, yes you can use a VPN. However most VPN requires the entire host connection to tunnel through the VPN. There are a few good VPN service out there, but sometimes you get what you pay for. I would caution using free VPN because some sell your network and other sell your data.
Second, this might be the easiest option. Using proxies. You can tell your scraper to proxy traffic though a free anonymous proxy. You can find a list of these free proxy from Google. Or you can check out ProxyBroker which finds free proxy for you. This only requires proxy the scraper traffic through a US IP address instead of your entire host connection.
answered Dec 28 ’19 at 15:22
I agree that using proxies will work better rather than using a vpn.
However, don’t go with a free proxy, if you want results. If it’s something you can invest in, get a decent paid provider, otherwise most likely nothing good will come out of this, as you will constantly get blocked.
answered Jan 6 ’20 at 13:51
Not the answer you’re looking for? Browse other questions tagged python web-scraping beautifulsoup ip vpn or ask your own question.
How to webscrape with VPN in Python? – Stack Overflow
I have made a Python program that webscrapes IMDB with Beautifulsoup to make a mySQL database with tables of all the top rated movies in the different categories. So far so good. My problem is that I am doing this from Norway, and many of the movie titles are translated to Norwegian. For example, in the top list of IMDB opened from a Norwegian IP adress, “The Shawshank Redemption” is translated to “Frihetens Regn”. I want all the titles in English. Are there maybe some free VPNs that you can activate from Python and that works with Beautifulsoup? Or do anyone have another solution to this?
asked Dec 28 ’19 at 14:46
1
You have a couple options, VPN and Proxy.
First, yes you can use a VPN. However most VPN requires the entire host connection to tunnel through the VPN. There are a few good VPN service out there, but sometimes you get what you pay for. I would caution using free VPN because some sell your network and other sell your data.
Second, this might be the easiest option. Using proxies. You can tell your scraper to proxy traffic though a free anonymous proxy. You can find a list of these free proxy from Google. Or you can check out ProxyBroker which finds free proxy for you. This only requires proxy the scraper traffic through a US IP address instead of your entire host connection.
answered Dec 28 ’19 at 15:22
I agree that using proxies will work better rather than using a vpn.
However, don’t go with a free proxy, if you want results. If it’s something you can invest in, get a decent paid provider, otherwise most likely nothing good will come out of this, as you will constantly get blocked.
answered Jan 6 ’20 at 13:51
Not the answer you’re looking for? Browse other questions tagged python web-scraping beautifulsoup ip vpn or ask your own question.
Can page scraping be detected? – Stack Overflow
So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?
I wrote the code in java, and it’s pretty much just checking for one line of the html code.
I thought I’ld get some insight on that before I add anymore code to this program. I mean it’s useful, and all, but it’s almost like a hack.
Seems like the worst case scenario as a result of this page scraper isn’t too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn’t accomplished anything at this point so this could go unnoticed.
Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven’t announced it, so all is good.
asked Aug 4 ’11 at 5:09
Slayer0248Slayer02481, 1733 gold badges14 silver badges25 bronze badges
2
It is a hack. 🙂
There’s no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it’s quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.
You should try to obey the file if you can, and rate limit yourself, to be polite.
answered Aug 4 ’11 at 5:12
Daniel LyonsDaniel Lyons21. 7k2 gold badges50 silver badges74 bronze badges
1
As a sysadmin myself, yes I’d probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I’d be suspicious. If a client browsed the site too quickly or in very predictable intervals, I’d be suspicious. If certain support files were never requested (, various linked in CSS and JS files), I’d be suspicious. If the client were accessing odd (not directly accessible) pages, I’d be suspicious.
Then again I’d have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn’t notice.
answered Aug 4 ’11 at 5:20
Chris EberleChris Eberle45. 8k12 gold badges77 silver badges114 bronze badges
0
It depends on how have you implemented this and how smart are the detection tools.
First take care about User-Agent. If you do not set it explicitly it will be something like “Java-1. 6”. Browsers send their “unique” user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).
Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i. e. try to add the headers to your requests (even if you do not need them).
Human user acts relatively slowly. Robot may act very quickly, i. retrieve the page and then “click” link, i. perform yet another HTTP GET. Put random sleep between these operations.
Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i. actually be “browser”.
And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.
Happy hacking!
answered Aug 4 ’11 at 5:24
AlexRAlexR111k14 gold badges121 silver badges197 bronze badges
If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.
To act like a human you will need to:
Look at what a browser sends in the HTTP headers and simulate them.
Look at what a browser requests for when accessing the page and access the same with the scraper
Time your scraper to access at the speed of a normal user
Send requests at random intervals of time instead of at fixed intervals
If possible make requests from a dynamic IP rather than a static one
answered Aug 4 ’11 at 5:25
manubkkmanubkk1, 43812 silver badges19 bronze badges
assuming you wrote the page scraper in a normal manner, i. e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can’t tell what you do with it once it’s in your RAM.
answered Aug 4 ’11 at 5:13
jcomeau_ictxjcomeau_ictx35. 7k6 gold badges90 silver badges101 bronze badges
To the server serving the page, there’s no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server’s business.
Having said that, a sophisticated server could conceivably detect activity that doesn’t look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn’t happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.
Whether a server tries to detect these things or not depends on the server, most don’t.
answered Aug 4 ’11 at 5:15
deceze♦deceze481k78 gold badges684 silver badges835 bronze badges
I’d like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I’ve cited a lot of examples in a blog post I recently wrote.
You should definitely abide the but also look at the websites T&C’s to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website’s Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!
For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.
answered Oct 21 ’13 at 16:13
RamiRami1, 0008 silver badges7 bronze badges
Not the answer you’re looking for? Browse other questions tagged java html web-scraping or ask your own question.
Frequently Asked Questions about vpn web scraping
Can I use VPN for web scraping?
First, yes you can use a VPN. However most VPN requires the entire host connection to tunnel through the VPN. There are a few good VPN service out there, but sometimes you get what you pay for. I would caution using free VPN because some sell your network and other sell your data.Dec 28, 2019
Is Web scraping detectable?
There’s no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it’s quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess.Aug 4, 2011
Is scraping proxies legal?
So is web scraping activity legal or not? It is not illegal as such. In the end, you can crawl and scrape your own website without much effort. Businesses use bots for their benefit but at the same time don’t want others to exploit web scrapers against them.Jul 6, 2021