Yellow Pages Scraper Python
Yellow Pages Business Details Scraper – GitHub
Web Scraper written in Python and LXML to extract business details available based on a particular category and location.
If you would like to know more about this scraper you can check it out at the blog post ‘How to Scrape Business Details from Yellow Pages using Python and LXML’ – Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Fields to Extract
This yellow pages scraper can extract the fields below:
Rank
Business Name
Phone Number
Business Page
Category
Website
Rating
Street name
Locality
Region
Zipcode
URL
Prerequisites
For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML.
Below are the package requirements:
lxml
requests
Installation
PIP to install the following packages in Python ()
Python Requests, to make requests and download the HTML content of the pages ()
Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here –)
Running the scraper
We would execute the code with the script name followed by the positional arguments keyword and place. Here is an example
to find the business details for restaurants in Boston. MA.
python3 restaurants Boston, MA
Sample Output
This will create a csv file:
Sample Output
Scraping Yellow pages with Python and BeautifulSoup
Today lets look at scraping Yellow pages data using Beautiful soup and the requests module in is a simple script that does that. BeautifulSoup will help us extract information and we will retrieve crucial pieces of start with, this is the boilerplate code we need to get the search results page and set up BeautifulSoup to help us use CSS selectors to query the page for meaningful data. # -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {‘User-Agent’:’Mozilla/5. 0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601. 3. 9 (KHTML, like Gecko) Version/9. 0. 2 Safari/601. 9′}
url = ”
(url, headers=headers)
soup=BeautifulSoup(ntent, ‘lxml’)We are also passing the user agent headers to simulate a browser call so we dont get let’s analyze the Yellow pages search results. This is how it when we inspect the page we find that each of the items HTML is encapsulated in a tag with the class could just use this to break the HTML document into these cards which contain individual item information like this. # -*- coding: utf-8 -*-
soup=BeautifulSoup(ntent, ‘lxml’)
for item in (‘. v-card’):
try:
print(‘—————————————-‘)
print(item)And when you run thon3 can tell that the code is isolating the v-cards further inspection, you can see that the name of the place always has the class business-name. So let’s try and retrieve that. # -*- coding: utf-8 -*-
print((‘. business-name’)[0]. get_text())
except Exception as e:
#raise e
print(”)That will get us the! Now let’s get the other data pieces. # -*- coding: utf-8 -*-
print((‘ div’)[0][‘class’])
print((‘ div span’)[0]. get_text())
print((”)[0]. get_text())
#print((‘. get_text())
print(”)And when run. Produces all the info we need including ratings, reviews, phone, address, more advanced implementations you will need to even rotate the User-Agent string so Yellow pages cant tell its the same browser! If we get a little bit more advanced, you will realize that Yellow pages can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP BlocksInvesting in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems millions of high speed rotating proxies located all over the world, With our automatic IP rotationWith our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)With our automatic CAPTCHA solving technology, Hundreds of our customers have successfully solved the headache of IP blocks with a simple whole thing can be accessed by a simple API like below in any programming “We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
How to Scrape Yellow Pages Using Python and LXML
In this web scraping tutorial, we will show you how to scrape yellow pages and extract business details based on a city and category.
For this scraper, we will search for restaurants in a city. Then scrape business details from the first page of results.
What data are we extracting?
Here is the list of data fields we will be extracting:
Rank
Business Name
Phone Number
Business Page
Category
Website
Rating
Street Name
Locality
Region
Zip code
URL
Below is a screenshot of the data we will be extracting from YellowPages.
Finding the Data
First, we need to find the data that is present in the web page’s HTML Tags before we start building the Yellowpages scraper. You’ll need to understand the HTML tags in the web pages content to do so.
If you already understand HTML and Python this will be simple for you. You don’t need advanced programming skills for the most part of this tutorial,
If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and Let’s inspect the HTML of the web page and find out where the data is located. Here is what we’re going to do:
Find the HTML tag that encloses the list of links from where we need the data from
Get the links from it and extract the data
Inspecting the HTML
Why should we inspect the element? – To find any element on the web page using XML path expression.
Open any browser (we are using Chrome here) and go to Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page in a well-structured format.
The GIF above shows the data we need to extract in the DIV tag. If you look closely, it has an attribute called ‘class’ named as ‘result’. This DIV contains the data fields we need to extract.
Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do Inspect Element again. It will open the HTML Content like before, and highlight the tag which holds the data you right-clicked on. In the GIF below you can see the data fields structured in order.
How to set up your computer to scrape Yellow Pages
We will use Python 3 for this Yellow Pages scraping tutorial. The code will not run if you are using Python 2. 7. To start, your system needs Python 3 and PIP installed in it.
Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.
To check your python version. Open a terminal (in Linux and Mac OS) or Command Prompt (on Windows) and type:
python –version
Then press the Enter key. If the output looks something like Python 3. x. x, you have Python 3 installed. Likewise if its Python 2. x, you have Python 2. But, if it prints an error, you don’t probably have python installed.
Install Python 3 and Pip
Here is a guide to install Python 3 in Linux – Mac Users can follow this guide here – Likewise, Windows Users go here – Install Packages
Python Requests, to make requests and download the HTML content of the pages ().
Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here –)
The Code
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
import unicodecsv as csv
import argparse
def parse_listing(keyword, place):
“””
Function to process yellowpage listing page: param keyword: search query: param place: place name
url = “0}&geo_location_terms={1}”(keyword, place)
print(“retrieving “, url)
headers = {‘Accept’: ‘text/html, application/xhtml+xml, application/xml;q=0. 9, image/webp, image/apng, */*;q=0. 8’,
‘Accept-Encoding’: ‘gzip, deflate, br’,
‘Accept-Language’: ‘en-GB, en;q=0. 9, en-US;q=0. 8, ml;q=0. 7’,
‘Cache-Control’: ‘max-age=0’,
‘Connection’: ‘keep-alive’,
‘Host’: ”,
‘Upgrade-Insecure-Requests’: ‘1’,
‘User-Agent’: ‘Mozilla/5. 0 (X11; Linux x86_64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/64. 0. 3282. 140 Safari/537. 36′}
# Adding retries
for retry in range(10):
try:
response = (url, verify=False, headers=headers)
print(“parsing page”)
if atus_code == 200:
parser = omstring()
# making links absolute
base_url = ”
ke_links_absolute(base_url)
XPATH_LISTINGS = “//div[@class=’search-results organic’]//div[@class=’v-card’]”
listings = (XPATH_LISTINGS)
scraped_results = []
for results in listings:
XPATH_BUSINESS_NAME = “. //a[@class=’business-name’]//text()”
XPATH_BUSSINESS_PAGE = “. //a[@class=’business-name’]//@href”
XPATH_TELEPHONE = “. //div[@class=’phones phone primary’]//text()”
XPATH_ADDRESS = “. //div[@class=’info’]//div//p[@itemprop=’address’]”
XPATH_STREET = “. //div[@class=’street-address’]//text()”
XPATH_LOCALITY = “. //div[@class=’locality’]//text()”
XPATH_REGION = “. //div[@class=’info’]//div//p[@itemprop=’address’]//span[@itemprop=’addressRegion’]//text()”
XPATH_ZIP_CODE = “. //div[@class=’info’]//div//p[@itemprop=’address’]//span[@itemprop=’postalCode’]//text()”
XPATH_RANK = “. //div[@class=’info’]//h2[@class=’n’]/text()”
XPATH_CATEGORIES = “. //div[@class=’info’]//div[contains(@class, ‘info-section’)]//div[@class=’categories’]//text()”
XPATH_WEBSITE = “. //div[@class=’info’]//div[contains(@class, ‘info-section’)]//div[@class=’links’]//a[contains(@class, ‘website’)]/@href”
XPATH_RATING = “. //div[@class=’info’]//div[contains(@class, ‘info-section’)]//div[contains(@class, ‘result-rating’)]//span//text()”
raw_business_name = (XPATH_BUSINESS_NAME)
raw_business_telephone = (XPATH_TELEPHONE)
raw_business_page = (XPATH_BUSSINESS_PAGE)
raw_categories = (XPATH_CATEGORIES)
raw_website = (XPATH_WEBSITE)
raw_rating = (XPATH_RATING)
# address = (XPATH_ADDRESS)
raw_street = (XPATH_STREET)
raw_locality = (XPATH_LOCALITY)
raw_region = (XPATH_REGION)
raw_zip_code = (XPATH_ZIP_CODE)
raw_rank = (XPATH_RANK)
business_name = ”(raw_business_name)() if raw_business_name else None
telephone = ”(raw_business_telephone)() if raw_business_telephone else None
business_page = ”(raw_business_page)() if raw_business_page else None
rank = ”(raw_rank). replace(‘. \xa0’, ”) if raw_rank else None
category = ‘, ‘(raw_categories)() if raw_categories else None
website = ”(raw_website)() if raw_website else None
rating = ”(raw_rating). replace(“(“, “”). replace(“)”, “”)() if raw_rating else None
street = ”(raw_street)() if raw_street else None
locality = ”(raw_locality). replace(‘, \xa0’, ”)() if raw_locality else None
locality, locality_parts = (‘, ‘)
_, region, zipcode = (‘ ‘)
business_details = {
‘business_name’: business_name,
‘telephone’: telephone,
‘business_page’: business_page,
‘rank’: rank,
‘category’: category,
‘website’: website,
‘rating’: rating,
‘street’: street,
‘locality’: locality,
‘region’: region,
‘zipcode’: zipcode,
‘listing_url’:}
(business_details)
return scraped_results
elif atus_code == 404:
print(“Could not find a location matching”, place)
# no need to retry for non existing page
break
else:
print(“Failed to process page”)
return []
except:
if __name__ == “__main__”:
argparser = gumentParser()
d_argument(‘keyword’, help=’Search Keyword’)
d_argument(‘place’, help=’Place Name’)
args = rse_args()
keyword = yword
place =
scraped_data = parse_listing(keyword, place)
if scraped_data:
print(“Writing scraped data to “% (keyword, place))
with open(”% (keyword, place), ‘wb’) as csvfile:
fieldnames = [‘rank’, ‘business_name’, ‘telephone’, ‘business_page’, ‘category’, ‘website’, ‘rating’,
‘street’, ‘locality’, ‘region’, ‘zipcode’, ‘listing_url’]
writer = csv. DictWriter(csvfile, fieldnames=fieldnames, quoting=csv. QUOTE_ALL)
writer. writeheader()
for data in scraped_data:
writer. writerow(data)
Execute the full code by typing the script name followed by a -h in command prompt or terminal:
usage: [-h] keyword place
positional arguments:
keyword Search Keyword
place Place Name
optional arguments:
-h, –help show this help message and exit
The positional arguments keyword represents a category and place is the desired location to search for a business. As an example, let’s find the business details for restaurants in Boston, MA. The script would be executed as:
python3 restaurants Boston, MA
You should see a file called in the same folder as the script, with the extracted data. Here is some sample data of the business details extracted from for the command above.
The data will be saved as a CSV file. You can download the code at Let us know in the comments how this code to scrape Yellowpages worked for you.
Known Limitations
This code should be capable to scrape business details of most locations. But if you want to scrape Yellowpages on a large scale, you should read How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.
If you need some professional help with scraping websites contact us by filling up the form below.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
Frequently Asked Questions about yellow pages scraper python
How do you scrap data from yellow pages?
To do this, click on the green “Get Data” button in the left sidebar. Here you, you can test, run or schedule your scrape. In this case, we will run it right away. ParseHub is now off to scrape the data you have selected from Yellow Pages.Aug 4, 2020
Is Python scraping legal?
Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. … Big companies use web scrapers for their own gain but also don’t want others to use bots against them.
What is Yellow Pages scraping?
Basically it is a tool that searches popular yellow pages directories and provides you all information such as businesses name, address, phone numbers as well as emails.