Amazon Aws Web Scraping
Serverless Architecture for a Web Scraping Solution – Amazon …
If you are interested in serverless architecture, you may have read many contradictory articles and wonder if serverless architectures are cost effective or expensive. I would like to clear the air around the issue of effectiveness through an analysis of a web scraping solution. The use case is fairly simple: at certain times during the day, I want to run a Python script and scrape a website. The execution of the script takes less than 15 minutes. This is an important consideration, which we will come back to later. The project can be considered as a standard extract, transform, load process without a user interface and can be packed into a self-containing function or a library.
Subsequently, we need an environment to execute the script. We have at least two options to consider: on-premises (such as on your local machine, a Raspberry Pi server at home, a virtual machine in a data center, and so on) or you can deploy it to the cloud. At first glance, the former option may feel more appealing — you have the infrastructure available free of charge, why not to use it? The main concern of an on-premises hosted solution is the reliability — can you assure its availability in case of a power outage or a hardware or network failure? Additionally, does your local infrastructure support continuous integration and continuous deployment (CI/CD) tools to eliminate any manual intervention? With these two constraints in mind, I will continue the analysis of the solutions in the cloud rather than on-premises.
Let’s start with the pricing of three cloud-based scenarios and go into details below.
*The AWS Lambda free usage tier includes 1M free requests per month and 400, 000 GB-seconds of compute time per month. Review AWS Lambda pricing.
Option #1
The first option, an instance of a virtual machine in AWS (called Amazon Elastic Cloud Compute or EC2), is the most primitive one. However, it definitely does not resemble any serverless architecture, so let’s consider it as a reference point or a baseline. This option is similar to an on-premises solution giving you full control of the instance, but you would need to manually spin an instance, install your environment, set up a scheduler to execute your script at a specific time, and keep it on for 24×7. And don’t forget the security (setting up a VPC, route tables, etc. ). Additionally, you will need to monitor the health of the instance and maybe run manual updates.
Option #2
The second option is to containerize the solution and deploy it on Amazon Elastic Container Service (ECS). The biggest advantage to this is platform independence. Having a Docker file (a text document that contains all the commands you could call on the command line to assemble an image) with a copy of your environment and the script enables you to reuse the solution locally—on the AWS platform, or somewhere else. A huge advantage to running it on AWS is that you can integrate with other services, such as AWS CodeCommit, AWS CodeBuild, AWS Batch, etc. You can also benefit from discounted compute resources such as Amazon EC2 Spot instances.
The architecture, seen in the diagram above, consists of Amazon CloudWatch, AWS Batch, and Amazon Elastic Container Registry (ECR). CloudWatch allows you to create a trigger (such as starting a job when a code update is committed to a code repository) or a scheduled event (such as executing a script every hour). We want the latter: executing a job based on a schedule. When triggered, AWS Batch will fetch a pre-built Docker image from Amazon ECR and execute it in a predefined environment. AWS Batch is a free-of-charge service and allows you to configure the environment and resources needed for a task execution. It relies on ECS, which manages resources at the execution time. You pay only for the compute resources consumed during the execution of a task.
You may wonder where the pre-built Docker image came from. It was pulled from Amazon ECR, and now you have two options to store your Docker image there:
You can build a Docker image locally and upload it to Amazon ECR.
You just commit few configuration files (such as Dockerfile,, etc. ) to AWS CodeCommit (a code repository) and build the Docker image on the AWS option, shown in the image below, allows you to build a full CI/CD pipeline. After updating a script file locally and committing the changes to a code repository on AWS CodeCommit, a CloudWatch event is triggered and AWS CodeBuild builds a new Docker image and commits it to Amazon ECR. When a scheduler starts a new task, it fetches the new image with your updated script file. If you feel like exploring further or you want actually implement this approach please take a look at the example of the project on GitHub.
Option #3
The third option is based on AWS Lambda, which allows you to build a very lean infrastructure on demand, scales continuously, and has generous monthly free tier. The major constraint of Lambda is that the execution time is capped at 15 minutes. If you have a task running longer than 15 minutes, you need to split it into subtasks and run them in parallel, or you can fall back to Option #2.
By default, Lambda gives you access to standard libraries (such as the Python Standard Library). In addition, you can build your own package to support the execution of your function or use Lambda Layers to gain access to external libraries or even external Linux based programs.
You can access AWS Lambda via the web console to create a new function, update your Lambda code, or execute it. However, if you go beyond the “Hello World” functionality, you may realize that online development is not sustainable. For example, if you want to access external libraries from your function, you need to archive them locally, upload to Amazon Simple Storage Service (Amazon S3), and link it to your Lambda function.
One way to automate Lambda function development is to use AWS Cloud Development Kit (AWS CDK), which is an open source software development framework to model and provision your cloud application resources using familiar programming languages. Initially, the setup and learning might feel strenuous; however the benefits are worth of it. To give you an example, please take a look at this Python class on GitHub, which creates a Lambda function, a CloudWatch event, IAM policies, and Lambda layers.
In a summary, the AWS CDK allows you to have infrastructure as code, and all changes will be stored in a code repository. For a deployment, AWS CDK builds an AWS CloudFormation template, which is a standard way to model infrastructure on AWS. Additionally, AWS Serverless Application Model (SAM) allows you to test and debug your serverless code locally, meaning that you can indeed create a continuous integration.
See an example of a Lambda-based web scraper on GitHub.
Conclusion
In this blog post, we reviewed two serverless architectures for a web scraper on AWS cloud. Additionally, we have explored the ways to implement a CI/CD pipeline in order to avoid any future manual interventions.
Amazon Kendra releases Web Crawler to enable web site search
Posted On: Jul 7, 2021
Amazon Kendra is an intelligent search service powered by machine learning, enabling organizations to provide relevant information to customers and employees, when they need it. Starting today, AWS customers can use the Amazon Kendra web crawler to index and search webpages.
Critical information can be scattered across multiple data sources in an enterprise, including internal and external websites. Amazon Kendra customers can now use the Kendra web crawler to index documents made available on websites (HTML, PDF, MS Word, MS PowerPoint, and Plain Text) and search for information across this content using Kendra Intelligent Search. Organizations can provide relevant search results to users seeking answers to their questions, for example, product specification detail that resides on a support website or company travel policy information that’s listed on an intranet webpage.
The Amazon Kendra web crawler is available in all AWS regions where Amazon Kendra is available. To learn more about the feature, visit the documentation page. To explore Amazon Kendra, visit the Amazon Kendra website.
Note: The Kendra web crawler honors access rules in, and customers using the Kendra web crawler will need to ensure they are authorized to index those webpages in order to return search results for end users.
How To Scrape Amazon Product Data – Data Science Central
Amazon, as the largest e-commerce corporation in the United States, offers the widest range of products in the world. Their product data can be useful in a variety of ways, and you can easily extract this data with web scraping. This guide will help you develop your approach for extracting product and pricing information from Amazon, and you’ll better understand how to use web scraping tools and tricks to efficiently gather the data you need.
The Benefits of Scraping Amazon
Web scraping Amazon data helps you concentrate on competitor price research, real-time cost monitoring and seasonal shifts in order to provide consumers with better product offers. Web scraping allows you to extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. You can even automate the process to update the data on a regular weekly or monthly basis.
There is currently no way to simply export product data from Amazon to a spreadsheet. Whether it’s for competitor testing, comparison shopping, creating an API for your app project or any other business need we’ve got you covered. This problem is easily solved with web scraping.
Here are some other specific benefits of using a web scraper for Amazon:
Utilize details from product search results to improve your Amazon SEO status or Amazon marketing campaigns
Compare and contrast your offering with that of your competitors
Use review data for review management and product optimization for retailers or manufacturers
Discover the products that are trending and look up the top-selling product lists for a group
Scraping Amazon is an intriguing business today, with a large number of companies offering goods, price, analysis, and other types of monitoring solutions specifically for Amazon. Attempting to scrape Amazon data on a wide scale, however, is a difficult process that often gets blocked by their anti-scraping technology. It’s no easy task to scrape such a giant site when you’re a beginner, so this step-by-step guide should help you scrape Amazon data, especially when you’re using Python Scrapy and Scraper API.
First, Decide On Your Web Scraping Approach
One method for scraping data from Amazon is to crawl each keyword’s category or shelf list, then request the product page for each one before moving on to the next. This is best for smaller scale, less-repetitive scraping. Another option is to create a database of products you want to track by having a list of products or ASINs (unique product identifiers), then have your Amazon web scraper scrape each of these individual pages every day/week/etc. This is the most common method among scrapers who track products for themselves or as a service.
Scrape Data From Amazon Using Scraper API with Python Scrapy
Scraper API allows you to scrape the most challenging websites like Amazon at scale for a fraction of the cost of using residential proxies. We designed anti-bot bypasses right into the API, and you can access additional features like IP geotargeting (&country code=us) for over 50 countries, JavaScript rendering (&render=true), JSON parsing (&autoparse=true) and more by simply adding extra parameters to your API requests. Send your requests to our single API endpoint or proxy port, and we’ll provide a successful HTML response.
Start Scraping with Scrapy
Scrapy is a web crawling and data extraction platform that can be used for a variety of applications such as data mining, information retrieval and historical archiving. Since Scrapy is written in the Python programming language, you’ll need to install Python before you can use pip (a python manager tool).
To install Scrapy using pip, run:
Then go to the folder where your project is saved (Scrapy automatically creates a web scraping project folder for you) and run the “startproject” command along with the project name, “amazon_scraper”. Scrapy will construct a web scraping project folder for you, with everything already set up:
scrapy startproject amazon_scraper
The result should look like this:
├── # deploy configuration file └── tutorial # project’s Python module, you’ll import your code from here ├── ├── # project items definition file ├── # project middlewares file ├── # project pipeline file ├── # project settings file └── spiders # a directory where spiders are located ├── └── # spider we just created
Scrapy creates all of the files you’ll need, and each file serves a particular purpose:
– Can be used to build your base dictionary, which you can then import into the spider.
– All of your request settings, pipeline, and middleware activation happens in You can adjust the delays, concurrency, and several other parameters here.
– The item yielded by the spider is transferred to, which is mainly used to clean the text and bind to databases (Excel, SQL, etc).
– When you want to change how the request is made and scrapy manages the answer, comes in handy.
Create an Amazon Spider
You’ve established the project’s overall structure, so now you’re ready to start working on the spiders that will do the scraping. Scrapy has a variety of spider species, but we’ll focus on the most popular one, the Generic Spider, in this tutorial.
Simply run the “genspider” command to make a new spider:
# syntax is –> scrapy genspider name_of_spider scrapy genspider amazon
Scrapy now creates a new file with a spider template, and you’ll gain a new file called “” in the spiders folder. Your code should look like the following:
import scrapy class AmazonSpider(): name = ‘amazon’ allowed_domains = [”] start_urls = [”] def parse(self, response): pass
Delete the default code (allowed domains, start urls, and the parse function) and replace it with your own, which should include these four functions:
start_requests — sends an Amazon search query with a specific keyword.
parse_keyword_response — extracts the ASIN value for each product returned in an Amazon keyword query, then sends a new request to Amazon for the product listing. It will also go to the next page and do the same thing.
parse_product_page — extracts all of the desired data from the product page.
get_url — sends the request to the Scraper API, which will return an HTML response.
Send a Search Query to Amazon
You can now scrape Amazon for a particular keyword using the following steps, with an Amazon spider and Scraper API as the proxy solution. This will allow you to scrape all of the key details from the product page and extract each product’s ASIN. All pages returned by the keyword query will be parsed by the spider. Try using these fields for the spider to scrape from the Amazon product page:
ASIN
Product name
Price
Product description
Image URL
Available sizes and colors
Customer ratings
Number of reviews
Seller ranking
The first step is to create start_requests, a function that sends Amazon search requests containing our keywords. Outside of AmazonSpider, you can easily identify a list variable using our search keywords. Input the keywords you want to search for in Amazon into your script:
queries = [‘tshirt for men’, ‘tshirt for women’]
Inside the AmazonSpider, you cas build your start_requests feature, which will submit the requests to Amazon. Submit a search query “k=SEARCH KEYWORD” to access Amazon’s search features via a URL:
It looks like this when we use it in the start_requests function:
## queries = [‘tshirt for men’, ‘tshirt for women’] class AmazonSpider(): def start_requests(self): for query in queries: url = ‘? ‘ + urlencode({‘k’: query}) yield quest(url=url, rse_keyword_response)
You will urlencode each query in your queries list so that it is secure to use as a query string in a URL, and then use quest to request that URL.
Use yield instead of return since Scrapy is asynchronous, so the functions can either return a request or a completed dictionary. If a new request is received, the callback method is invoked. If an object is yielded, it will be sent to the data cleaning pipeline. The parse_keyword_response callback function will then extract the ASIN for each product when quest activates it.
How to Scrape Amazon Products
One of the most popular methods to scrape Amazon includes extracting data from a product listing page. Using an Amazon product page ASIN ID is the simplest and most common way to retrieve this data. Every product on Amazon has an ASIN, which is a unique identifier. We may use this ID in our URLs to get the product page for any Amazon product, such as the following:
Using Scrapy’s built-in XPath selector extractor methods, we can extract the ASIN value from the product listing tab. You can build an XPath selector in Scrapy Shell that captures the ASIN value for each product on the product listing page and generates a url for each product:
products = (‘//*[@data-asin]’) for product in products: asin = (‘@data-asin’). extract_first() product_url = f”asin}”
The function will then be configured to send a request to this URL and then call the parse_product_page callback function when it receives a response. This request will also include the meta parameter, which is used to move items between functions or edit certain settings.
def parse_keyword_response(self, response): products = (‘//*[@data-asin]’) for product in products: asin = (‘@data-asin’). extract_first() product_url = f”asin}” yield quest(url=product_url, rse_product_page, meta={‘asin’: asin})
Extract Product Data From the Amazon Product Page
After the parse_keyword_response function requests the product pages URL, it transfers the response it receives from Amazon along with the ASIN ID in the meta parameter to the parse product page callback function. We now want to derive the information we need from a product page, such as a product page for a t-shirt.
You need to create XPath selectors to extract each field from the HTML response we get from Amazon:
def parse_product_page(self, response): asin = [‘asin’] title = (‘//*[@id=”productTitle”]/text()’). extract_first() image = (‘”large”:”(. *? )”‘, )()[0] rating = (‘//*[@id=”acrPopover”]/@title’). extract_first() number_of_reviews = (‘//*[@id=”acrCustomerReviewText”]/text()’). extract_first() bullet_points = (‘//*[@id=”feature-bullets”]//li/span/text()’). extract() seller_rank = (‘//*[text()=”Amazon Best Sellers Rank:”]/parent::*//text()[not(parent::style)]’). extract()
Try using a regex selector over an XPath selector for scraping the image url if the XPath is extracting the image in base64.
When working with large websites like Amazon that have a variety of product pages, you’ll find that writing a single XPath selector isn’t always enough since it will work on certain pages but not others. To deal with the different page layouts, you’ll need to write several XPath selectors in situations like these.
When you run into this issue, give the spider three different XPath options:
def parse_product_page(self, response): asin = [‘asin’] title = (‘//*[@id=”productTitle”]/text()’). extract() price = (‘//*[@id=”priceblock_ourprice”]/text()’). extract_first() if not price: price = (‘//*[@data-asin-price]/@data-asin-price’). extract_first() or \ (‘//*[@id=”price_inside_buybox”]/text()’). extract_first()
If the spider is unable to locate a price using the first XPath selector, it goes on to the next. If we look at the product page again, we can see that there are different sizes and colors of the product.
To get this info, we’ll write a fast test to see if this section is on the page, and if it is, we’ll use regex selectors to extract it.
temp = (‘//*[@id=”twister”]’) sizes = [] colors = [] if temp: s = (‘”variationValues”: ({. *})’, )()[0] json_acceptable = place(“‘”, “\””) di = (json_acceptable) sizes = (‘size_name’