Amazon Product Scraper Python
Scraping Amazon Product Information using Beautiful Soup
Web scraping is a data extraction method used to exclusively gather data from websites. It is widely used for Data mining or collecting valuable insights from large websites. Web scraping comes in handy for personal use as well. Python contains an amazing library called BeautifulSoup to allow web scraping. We will be using it to scrape product information and save the details in a CSV this article, Needed the following are prerequisites. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics. To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level A text file with few urls of amazon product pages to scrapeElement Id: We need Id’s of objects we wish to web scrape, Will cover it soon…Here, our text file looks needed and installation:BeautifulSoup: Our primary module contains a method to access a webpage over install bs4lxml: Helper library to process webpages in python install lxmlrequests: Makes the process of sending HTTP requests output of the functionpip install requestsApproach:First, we are going to import our required we will take the URL stored in our text will feed the URL to our soup object which will then extract relevant information from the given URLbased on the element id we provide it and save it to our CSV ’s have a look at the code, We will see what’s happening at each significant 1: Initializing our import our beautifulsoup and requests, Create/Open a CSV file to save our gathered data. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally get blocked by them. There are plenty of user agents available thon3from bs4 import BeautifulSoupimport requestsFile = open(“”, “a”)HEADERS = ({‘User-Agent’: ‘Mozilla/5. 0 (X11; Linux x86_64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/44. 0. 2403. 157 Safari/537. 36’, ‘Accept-Language’: ‘en-US, en;q=0. 5’})webpage = (URL, headers=HEADERS)soup = BeautifulSoup(ntent, “lxml”)Step 2: Retrieving element identify elements by seeing the rendered web pages, but one can’t say the same for our script. To pinpoint our target element, we will grab its element id and feed it to the script. Getting the id of an element is pretty simple. Suppose I need the element id of the products name, All I have to do Get to the URL and inspect the textIn the console, we grab the text next to id=copy the element idWe feed it to and convert the function’s output into a string. We remove commas from the string so that it won’t interfere with the CSV try-except writing format. Python3try: title = (“span”, attrs={“id”: ‘productTitle’}) title_value = title_string = title_value (). replace(‘, ‘, ”)except AttributeError: title_string = “NA” print(“product Title = “, title_string)Step 3: Saving current information to a text fileWe use our file object and write the string we just captured, and end the string with a comma “, ” to separate its column when it’s interpreted in a CSV (f”{title_string}, “)Doing the above 2 steps with all of the attributes we wish to capture from weblike Item price, availability 4: Closing the (f”{available}, \n”)()While writing the last bit of information, notice how we add “\n” to change the line. Not doing so will give us all the required information in one very long row. We close the file using (). This is necessary, if we do not do this we might get an error next time we open the 5: Calling the function we just thon3if __name__ == ‘__main__’: file = open(“”, “r”) for links in adlines(): main(links)We open the in reading mode and iterate over each of its lines until we reach the last one. Calling the main function on each is how our entire code looks like:Pythonfrom bs4 import BeautifulSoupimport requestsdef main(URL): File = open(“”, “a”) HEADERS = ({‘User-Agent’: ‘Mozilla/5. 5’}) webpage = (URL, headers=HEADERS) soup = BeautifulSoup(ntent, “lxml”) try: title = (“span”, attrs={“id”: ‘productTitle’}) title_value = title_string = (). replace(‘, ‘, ”) except AttributeError: title_string = “NA” print(“product Title = “, title_string) (f”{title_string}, “) try: price = ( “span”, attrs={‘id’: ‘priceblock_ourprice’}) (). replace(‘, ‘, ”) except AttributeError: price = “NA” print(“Products price = “, price) (f”{price}, “) try: rating = (“i”, attrs={ ‘class’: ‘a-icon a-icon-star a-star-4-5’}) (). replace(‘, ‘, ”) except AttributeError: try: rating = ( “span”, attrs={‘class’: ‘a-icon-alt’}) (). replace(‘, ‘, ”) except: rating = “NA” print(“Overall rating = “, rating) (f”{rating}, “) try: review_count = ( “span”, attrs={‘id’: ‘acrCustomerReviewText’}) (). replace(‘, ‘, ”) except AttributeError: review_count = “NA” print(“Total reviews = “, review_count) (f”{review_count}, “) try: available = (“div”, attrs={‘id’: ‘availability’}) available = (“span”) (). replace(‘, ‘, ”) except AttributeError: available = “NA” print(“Availability = “, available) (f”{available}, \n”) ()if __name__ == ‘__main__’: file = open(“”, “r”) for links in adlines(): main(links)Output:product Title = Dremel DigiLab 3D40 Flex 3D Printer w/Extra Supplies 30 Lesson Plans Professional Development Course Flexible Build Plate Automated 9-Point Leveling PC & MAC OS Chromebook iPad Compatible Products price = $1699. 00 Overall rating = 4. 1 out of 5 stars Total reviews = 40 ratings Availability = In Stock. product Title = Comgrow Creality Ender 3 Pro 3D Printer with Removable Build Surface Plate and UL Certified Power Supply 220x220x250mm Products price = NA Overall rating = 4. 6 out of 5 stars Total reviews = 2509 ratings Availability = NA product Title = Dremel Digilab 3D20 3D Printer Idea Builder for Brand New Hobbyists and Tinkerers Products price = $679. 5 out of 5 stars Total reviews = 584 ratings Availability = In Stock. product Title = Dremel DigiLab 3D45 Award Winning 3D Printer w/Filament PC & MAC OS Chromebook iPad Compatible Network-Friendly Built-in HD Camera Heated Build Plate Nylon ECO ABS PETG PLA Print Capability Products price = $1710. 81 Overall rating = 4. 5 out of 5 stars Total reviews = 351 ratings Availability = In Stock. Here’s how our looks like.
How to Scrape Amazon Prices With Python – Towards Data …
With only a few lines of Python, you can build your own web scraping tool to monitor multiple stores so you never miss a great deal! Stay alert to those Amazon deals! (photo by Gary Bendig)I think it’s fair to assume that at one point, we all had a bookmarked product page from Amazon, which we refreshed frantically hoping for the price to go, maybe not frantically, but definitely several times a day. I will show you how to write a simple Python script that can scrape Amazon product pages from any of their stores and check the price, among other things. I make an effort to keep it simple, and if you already know the basic stuff you should be able to follow along smoothly. Here’s what we will do in this project:Create a csv file with the links for the products we want, as well as the price we are willing to buy themWrite a function with Beautiful Soup that will cycle through the links in the csv file and retrieve information about themStore everything in a “database” and keep track of product price over time so we can see the historical trendSchedule the script to run at specific times during the dayExtra — creating an email alert for when a price is lower than your limitI am also working on a script that takes search terms instead of the product links and returns the most relevant products from all stores. If you feel like that could be useful let me know in the comments and I’ll write another article about it! Update: I just created a video with the whole process! Check it out and let me know how much you saved:)The saddest thing about expanding your hobbies into something more professional is usually the need to purchase better equipment. Consider photography, for example. Purchasing a new camera is a decision that can impact your sleeping time dramatically. Buying a new lens is also known to have a similar effect— at least that is my experience! Not only are the technical aspects very relevant and require some research, but you also want — or hope — to get the best possible the research is done, it is time to search for the chosen model in every online store we ltiply the number of stores you know by the number of selected models and you get an approximate number of tabs open on my browser. Most of the time I end up visiting Amazon again… In my case, since I am in Europe, it usually involves searching Amazon stores from countries like Italy, France, UK, Spain, and Germany. I have found very different prices for the same products, but more importantly, there are occasional deals specific to each market. And we don’t want to miss out on that…This is where knowing a bit of Python can save you money. Literally! I started testing some web scraping to help me automate this task, and it turns out that the HTML structure of each store is pretty much the same. This way we can use our script in all of them to quickly have all the prices from all the you would like to have access to my other web scraping articles — and pretty much everything else on Medium — have you considered subscribing? You would be supporting my work tremendously! I have written a few articles about web scraping before where I explain how Beautiful Soup works. This python package is very easy to use and you can check this article I wrote to scrape prices from house listings. In a nutshell, Beautiful Soup is a tool that you can use to access specific tags from an HTML page. Even if you haven’t heard about it before, I bet you can understand what it is doing once you see the will need a file named with the links for the products you want to stalk. A template is provided in this Github repository. After each run, the scraper saves the results in a different file called “search_history_[whatever date]”. These are the files inside the search_history folder. You can also find it in the is the folder and file structure we needSince I do not want to overcomplicate things, I will use a Jupyter notebook to show you the code outputs first. But in the end, we will give this code some steroids and turn it into a nice function inside a file. To top it up we’ll create a scheduled task to run it from time to time! Soup is all we needLet’s start with the notebook view. Use these next snippets on your Jupyter notebook so you can see what each part of the code least known package here is probably glob. It’s a nice package that allows you to get things like a list of filenames inside a folder. Requests will take care of fetching the URLs we set up in the tracker list. Beautiful Soup is our web scraping tool for this challenge. If you need to install any of them, a simple pip/conda install will do. There are plenty of resources that can help you out, but usually, the Python Package Index page will have HEADERS variable is needed to pass along the get method. I also discuss it in my other articles mentioned above, but you can think of it as your ID card when visiting URLs with the requests package. It tells the server what kind of browser you are using. You can read more about this might be a good time to get that csv file called from the repository and place it in the folder “trackers” we run it through Beautiful Soup. This will tame the HTML into something more “accessible”, (cleverly) named Prince throwback to keep you entertained! Which ingredients should we pick? We start with simple things like the title of the product. Although this is almost redundant, it’s a nice thing to give some detail about the product once we check the excel file. I added the other fields like the review score/count and the availability, not because they are relevant to the purchase decision, but because I am a bit OCD with these things and I prefer to have variables that “can be useful someday”… It could be interesting to see the evolution of the review score for a particular product. Maybe not, but we’ll keep it anyway! Whenever you see, it means we are trying to find an element of the page using its HTML tag (like div, or span, etc. ) and/or attributes (name, id, class, etc. ). With, we are using CSS selectors. You can use the Inspect feature on your browser and navigate the page code, but I recently came across a very handy Chrome extension called SelectorGadget, and I highly recommend it. It makes finding the right codes way this point, if you find it difficult to follow how the selectors work, I encourage you to read this article, where I explain a bit more in you can see what each part returns. If you are starting out with web scraping and still don’t understand entirely how it works, I recommend you break the code into bits and slowly figure out what each variable is doing. I always write this disclaimer in web scraping articles, but there is a chance that if you read this article months from now, the code here may not be working correctly anymore — it may happen if Amazon changes the HTML structure of the page, for that happens, I would encourage you to fix it! Once you break up the code and use my examples it’s really no big deal. Just need to fix the selectors/tags your code is fetching. You can always leave a comment below and I’ll try to help you out. (De)constructing the soupGetting the title of our product should be a piece of cake. The price part is a bit more challenging and in the end, I had to add a few lines of code to get it right, but in this example, this part is will see later in the final version of the script, that I also added the functionality to get prices in USD, for my readers who shop from the US store! This led to an increase in try/except controls, to make sure I was getting the right field every time. When writing and testing a web scraper, we always face the same choice sooner or can waste 2 hours trying to get the exact piece of HTML that will get the right part of the page every time — without a guarantee that it’s even possible! Or we can simply improvise some error handling conditions, that will get us to a working tool faster. I don’t always do one or the other, but I did learn to comment a lot more when I’m writing code. That really helps to increase the quality of your code and even when you want to go back and restart working on a previous is the script in actionAs you can see, getting the individual ingredients is quite easy. After the testing is done, it is time to write a proper script that will:get the URLs from a csv fileuse a while loop to scrape each product and store the informationsave all the results, including previous searches in an excel fileTo write this you will need your favorite code editor (I use Spyder, which comes with the Anaconda installation — sidenote: version 4 is quite good) and create a new file. We’ll call it are a few additions to the bits and pieces we have seen above, but I hope the comments help to make it clear. This file is also in the acker ProductsA few notes about the file It’s a very simple file, with three columns (“url”, “code”, “buy_below”). This is where you will add product URLs that you want to can even place this file in some part of your synced Dropbox folder (update the script with the new file paths afterward), so you can update it anytime with your mobile. If you have the script set up to run on a server or on your own laptop at home, it would pick that new product link from the file in its next HistoryThe same with the SEARCH_HISTORY files. On the first run you need to add an empty file (find it in the repository) to the folder “search_history”. In line 116 from the script above, when defining the last_search variable we are trying to find the last file inside the search history folder. That is why you also need to map your own folder here. Just replace the text with the folder where you are running this project (“Amazon Scraper” in my example) AlertStill in the script above, line 97 you have a section that you can use to send an email with some kind of alert if the price is lower than your happens every time your price hits the reason why it is inside a try command is that not every time we get an actual price from the product page so the logical comparison would return an error — if the product was unavailable for instance. I did not want to overcomplicate the script too much so I left it out of the final code. However, you can have the code from a very similar project in this article. I left a print instruction with a buying alert in the script, so you just need to replace it with the email part. Consider it your homework! Dear Mac/Linux readers, this part will be solely about the scheduler in Windows, as it is the system I use. Sorry Mac/Linux users! I am 100% positive there are alternatives for those systems too. If you can let me know in the comments, I’ll add them tting up an automated task to execute our little script is far from difficult:1 — You start by opening “Task Scheduler” (simply press the windows key and type it). You then chose “Create Task” and pick the “Triggers” tab. I have mine to run every day at 10h00, and 19h30. 2 — Next you move to the actions tab. Here you will add an action and pick your Python folder location for the “Program/script” box. Mine is located in the Program Files directory as you can see in the image. 3 — In the arguments box, you want to type the name of our file with the function. 4 — And we are going to tell the system to start this command in the folder where our file here the task is ready to run. You can explore more options and make a test run to make sure it works. This is the basic way to schedule your scripts to run automatically with Windows Task Scheduler! I think we covered a lot of interesting features you can now use to explore other websites or to build something more complex. Thank you for reading and if you have any questions or suggestions, I try to reply to all messages! I might consider doing a video tutorial if I see some requests below, so let me know if you would like that to go along with the you want to see my other Web Scraping examples, here are two different projects:If you read this far, you probably realized already that I like photography, so as a token of my appreciation, I’ll leave you with one of my photos! Thank you for reading! As always, I welcome feedback and constructive criticism. If you’d like to get in touch, you can contact me here or simply reply to the article below.
How To Scrape Amazon Product Data using Python
Web scraping helps in automating data extraction from websites. In this tutorial, we will build an Amazon scraper for extracting product details and pricing. We will build this simple web scraper using Python and SelectorLib and run it in a console.
Here is how you can scrape Amazon product details from Amazon product pageSetting up your computer for Amazon ScrapingPackages to install for Amazon scrapingScrape product details from the Amazon Product PageMarkup the data fields using SelectorlibThe CodeRunning the Amazon Product Page ScraperScrape Amazon products from the Search Results PageMarkup the data fields using SelectorlibThe CodeRunning the Amazon Scraper to Scrape Search ResultWhat to do if you get blocked while scraping AmazonUse proxies and rotate themSpecify the User Agents of latest browsers and rotate themReduce the number of ASINs scraped per minuteRetry, Retry, RetryHow to Solve Amazon Scraping ChallengesUse a Web Scraping Framework like PySpider or ScrapyIf you need speed, Distribute and Scale-Up using a Cloud ProviderUse a scheduler if you need to run the scraper periodicallyUse a database to store the Scraped Data from AmazonUse Request Headers, Proxies, and IP Rotation to prevent getting Captchas from AmazonWrite some simple data quality testsHow to use Amazon Product Data
Here is how you can scrape Amazon product details from Amazon product page
Markup the data fields to be scraped using Selectorlib
Copy and run the code provided
Check out our web scraping tutorials to learn how to scrape Amazon Reviews easily using Google Chrome and how to build a Amazon Review Scraper using Python.
We have also provided how you can scrape product details from Amazon search result page, how to avoid getting blocked by Amazon and how to scrape Amazon on a large scale below.
Setting up your computer for Amazon Scraping
We will use Python 3 for this Amazon scraper. The code will not run if you are using Python 2. 7. To start, you need a computer with Python 3 and PIP installed in it.
Follow this guide to setup your computer and install packages if you are on windows
How To Install Python Packages for Web Scraping in Windows 10
Packages to install for Amazon scraping
Python Requests, to make requests and download the HTML content of the Amazon product pages
SelectorLib python package to extract data using the YAML file we created from the webpages we download
Using pip3,
pip3 install requests requests selectorlib
Scrape product details from the Amazon Product Page
The Amazon product page scraper will scrape the following details from product page.
Product Name
Price
Short Description
Full Product Description
Image URLs
Rating
Number of Reviews
Variant ASINs
Sales Rank
Link to all Reviews Page
Markup the data fields using Selectorlib
We have already marked up the data, so you can just skip this step if you want to get right to the data.
Here is how our template looks like. See the file here
Let’s save this as a file called in the same directory as our code.
name:
css: ‘#productTitle’
type: Text
price:
css: ‘#price_inside_buybox’
short_description:
css: ‘#featurebullets_feature_div’
images:
css: ‘. imgTagWrapper img’
type: Attribute
attribute: data-a-dynamic-image
rating:
css:
number_of_reviews:
css: ‘a. a-link-normal h2’
variants:
css: ‘form. a-section li’
multiple: true
children:
css: “”
attribute: title
asin:
attribute: data-defaultasin
product_description:
css: ‘#productDescription’
sales_rank:
css: ‘li#SalesRank’
link_to_all_reviews:
css: ‘ a. a-link-emphasis’
type: Link
Here is a preview of the markup
Selectorlib is a combination of tools for developers that makes marking up and extracting data from web pages easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like.
You can learn more about Selectorlib and how to use it to markup data here
If you don’t like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
Get Started for Free
The Code
Create a folder called amazon-scraper and paste your selectorlib yaml template file as
Let’s create a file called and paste the code below into it. All it does is
Read a list of Amazon Product URLs from a file called
Scrape the data
Save the data as a JSON Lines file
from selectorlib import Extractor
import requests
import json
from time import sleep
# Create an Extractor by reading from the YAML file
e = om_yaml_file(”)
def scrape(url):
headers = {
‘authority’: ”,
‘pragma’: ‘no-cache’,
‘cache-control’: ‘no-cache’,
‘dnt’: ‘1’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5. 0 (X11; CrOS x86_64 8172. 45. 0) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/51. 0. 2704. 64 Safari/537. 36’,
‘accept’: ‘text/html, application/xhtml+xml, application/xml;q=0. 9, image/webp, image/apng, */*;q=0. 8, application/signed-exchange;v=b3;q=0. 9’,
‘sec-fetch-site’: ‘none’,
‘sec-fetch-mode’: ‘navigate’,
‘sec-fetch-dest’: ‘document’,
‘accept-language’: ‘en-GB, en-US;q=0. 9, en;q=0. 8’, }
# Download the page using requests
print(“Downloading%s”%url)
r = (url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if atus_code > 500:
if “To discuss automated access to Amazon data please contact” in
print(“Page%s was blocked by Amazon. Please try using better proxies\n”%url)
else:
print(“Page%s must have been blocked by Amazon as the status code was%d”%(url, atus_code))
return None
# Pass the HTML of the page and create
return e. extract()
# product_data = []
with open(“”, ‘r’) as urllist, open(”, ‘w’) as outfile:
for url in adlines():
data = scrape(url)
if data:
(data, outfile)
(“\n”)
# sleep(5)
Running the Amazon Product Page Scraper
You can get the full code from Github – You can start your scraper by typing the command:
python3
Once the scrape is complete you should see a file called with your data. Here is an example for the URL
{
“name”: “2020 HP 15. 6\” Laptop Computer
Frequently Asked Questions about amazon product scraper python
16GB DDR4 RAM
512GB PCIe SSD
802. 11ac WiFi
Bluetooth 4. 2
Silver
Windows 10