Sitemap Scraper
Sitemap xml selector | Web Scraper Documentation
link selector can be used similarly as Link selector to get to target pages (for example product pages).
By using this selector, the whole site can be traversed without setting up selectors for pagination or other site navigation.
The link selector extracts URLs from files which websites publish so that search engine crawlers can navigate the sites easier.
In most cases, they contain all of the sites relevant page URLs.
Web Scraper supports standard format.
The file can also be compressed ().
If a contains URLs to other files, the selector will work recursively to find all URLs in sub files.
Note! Web Scraper has download size limit.
If multiple URLs are used, scraping job might fail due to exceeding the limit.
To work around this, try splitting the sitemap into multiple sitemaps, where each sitemap has only one
Note! Sites that have files are sometimes quite large.
We recommend using Web Scraper Cloud for large volume scraping.
Configuration options
urls – list of URLs of the sites files. Multiple URLs can be added. By clicking on “Add from ”
Web Scraper will automatically add all URLs that can be found in sites file.
If no URLs are found, it is worth checking URL which might contain a file that isn’t listed in the file.
found URL RegEx (optional) – regular expression to match a substring from the URLs. If set, only URLs from that
match RegEx will be scraped.
minimum priority (optional) – minimum priority of URLs to be scraped.
Inspect the file to decide if this value should be filled.
Use cases
files are usually used for sites that want to be indexed by search engines, sitemaps can be found for most:
e-commerce sites;
travel sites;
news sites;
yellow pages.
Best way to scrape the whole site is by using link selector. It removes the necessity of dealing
with pagination, categories and search forms/queries. Some sites don’t display category tree(breadcrumbs) if the page is
opened directly. In these cases site has to be traversed through category pages to scrape the category tree.
Making sure that only specific pages are scraped
As in most cases, contains all pages of the site, it is possible to limit the scraper so it scrapes only
the pages that contain the required data. For example, e-commerce sites will contain of product pages,
category pages and contact/about/etc. pages. To limit the scraper, so that it scrapes only product pages, one or more methods
can be used:
Using RegEx – if all product URLs contain a specific string that other type pages don’t contain, then this string can
be set in RegEx field and the scraper will traverse only pages that match it. For example, /product/. This will prevent
the scraper from traversing and scraping unnecessary pages.
Setting priority – some sites prioritize specific page types over the others. If that is the case, setting priority
will improve scraped page precision.
Using wrapper element selector – if none of the previously mentioned methods are possible, an element wrapper
selector can be set up. This method works for all sites and doesn’t return empty records in the result file if invalid or
unnecessary page is traversed. To set up the element wrapper selector, follow these steps:
Open a few pages that needs to be scraped.
Find an element that can be found only in these type of pages, for example a product title oduct-title.
Create an element selector and set it as a child selector for link selector.
Set element selector to multiple and set its selector to (for example) body:has(oduct-title).
Select rest of the selectors as child selectors for this element selector.
The key part of this method is that a unique element has to be found and included in body:has(unique_selector)
selector. If the data from meta tags has to be scraped, html tag can be used instead of body tag. Scraper will
extract data only from the pages that have this unique element.
When using selector, set the main page of the site as a start URL.
Sitemap Scraper Addon – ScrapeBox
The ScrapeBox Sitemap Scraper addon is included free with ScrapeBox, and it allows you to extract URL’s from or sitemaps. Sitemaps generally list all of a sites pages, so being able to gather every URL belonging to a site via a sitemap is a far easier and faster way to gather this information rather than harvesting it from search engines using various site: operators.
The sitemap scraper addon also has a “Deep Crawl” facility where it will visit every URL listed in the sitemap, then fetch any further new URL’s listed on those pages that are not contained in the sitemap. Occasionally sites only list the most important pages in their sitemap, so the deep crawl can dig deep extracting thousands of extra URL’s.
You can also use keyword filters to control what URL’s are crawled and not crawled, this is ideal on large sites that may contain thousands of unnecessary pages like a calendar or files such as documents you wish to avoid. As seen here you can also opt to skip URL’s using to avoid secure sections of a website listed in the sitemap file
Once the sitemap URL’s are extracted, they can be viewed or exported to a text file for further use in ScrapeBox such as checking the Pagerank of all URL’s, creating a HTML sitemap, extracting the page Titles, Descriptions and Keywords, checking the Google cache dates or even scanning the list in the ScrapeBox malware checker addon to ensure all your pages are clean. ScrapeBox also has a Sitemap Creator which enables you to create a sitemap from a list of URL’s.
Using Sitemap to write efficient web scrapers A step by step …
This post is the part of Scraping Series.
Usually, when you start developing a scraper to scrape loads of records, your first step is usually to go to the page where all listings are available. You go to the page by page, fetch individual URLs, store in DB or in a file and then start parsing. Nothing wrong with it. The only issue is the wastage of resources. Say there are 100 records in a certain category. Each page has 10 records. Ideally, you will write a scraper that will go page by page and fetch all links. Then you will switch to the next category and repeat the process. Imagine there are 10 categories on a website and each category had 100 records. So the calculation would be:
Total records = categories x records in a category = 10 x 100 = 1000
Total requests to fetch records in a category = No of Page x 10 = 10 Requests
Requests in 2 categories = 10 x 2 = 20
If a single request takes on average 500ms so for the above numbers it takes 10 seconds.
So with given numbers, you are always going to make 20 extra requests. If you have n categories so 10n extra requests.
The point I am trying to make is that you are wasting both time and bandwidth here for sure even if you avoid blocking by using proxies. So, what to do? Do we have an option? Yes, we can take advantage of sitemaps.
What is Sitemap
From Google
A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them. Search engines like Google read this file to more intelligently crawl your site. A sitemap tells Google which pages and files you think are important in your site, and also provides valuable information about these files: for example, for pages, when the page was last updated, how often the page is changed, and any alternate language versions of a page.
Basically, this single file holds all info that is required by Google to index your website. Though Google runs its own crawlers too and your site will be indexed even without a sitemap but this particular practice can help Google to fetch the info you want to be available as per your need. Usually, a sitemap URL looks like but it is not necessary. The webmaster can change the default sitemap URL as per will. We will see further later in this post how to deal with such situations. A simple Sitemap file looks like the below(Credit:):
The most important field for us is
Alright so let’s do some work. Earlier I had scraped the Airbnb website. Now we will deal with it differently. Our first goal to find the sitemap URL. Unfortunately, this is not the default URL. So how do we find it, brute force? No. All I am going to check the file. This file is actually for search engine and it contains all instructions on what to crawl or what not to crawl. Like a sitemap file, this file is also at the root of the domain. Fortunately, this file will always be a file hence you can always refer to it for the sitemap. If you visit Airbnb file (), you will find the sitemap entry in it.
Sitemap: This is a kind of syntax. You mention Sitemap: and after that the root URL of your sitemap file. If you notice it is a compressed file hence will have to download and uncompress it. You may download it manually or automate it, it is not the topic of this post. Once I uncompress it I find the file in it. If you open it you will find entries like the below:
As you can see it contains more files. What all we have to do is to explore one by one or figure out and download that file only and unzip it.
If you open and view the file file you will find files like,, etc and there are many such links. Let’s download and see what is inside. It’s a pretty huge file in MBs. Let’s write the code to fetch all URLs in the file.
This snippet reads the sitemap file, use xmltodict to convert it into a dict and then iterate URLs and save them into a file. Once this is done, you will find there were 50K records. You will find individual URLs like there. Now imagine fetching all 50K URLs by going page by page. Even if there are 100 records per page you will still need to go make 50 requests. Yeah, 50 additional requests! Here is one request, you have all of them. Cool, No?
Alright, now I am going to pick one of the links and will scrape some info. I am going to use Scraper API. Why because when I tried to scrape it normally I did not get the info as the page was being generated dynamically via Javascript. So the best option is to use an online cloud-based API to scrape the info which also provides remote headless scraping.
For sake of example, I only have scraped the title as the purpose of this post was to educate you about the approach of getting individual URLs.
Conclusion
In this post, you learned a trick of getting parsing links which makes your entire flow more manageable and easy. You can automate the entire process: From access the main Sitemap URL and download and extracting zip files and then saving records into a file.
Oh if you sign up here with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.
Frequently Asked Questions about sitemap scraper
How do you scrape a sitemap?
How to scrape a website using sitemap. xml and Apify SDKLocate the sitemap, using either the /sitemap. … Identify the URLs for the pages you want to scrape and create a regular expression to capture them.Import the URLs into the Apify SDK’s RequestList.More items…
Is Web scraping free?
Data Scraper (Chrome) Data Scraper can scrape data from tables and listing type data from a single web page. Its free plan should satisfy most simple scraping with a light amount of data. The paid plan has more features such as API and many anonymous IP proxies. You can fetch a large volume of data in real-time faster.Aug 3, 2021
What is Webscraper io?
Webscraper.io is a free extension for the Google Chrome web browser with which users can extract information from any public website using HTML and CSS and export the data as a Comma Separated Value (CSV) file, which can be opened in spreadsheet processing software like Excel or Google Sheets.