Craigslist Web Crawler

Craigslist Web Crawler

November 16, 2021
0

Web Scraping Craigslist: A Complete Tutorial | by Riley Predum

I’ve been looking to make a move recently. And what better way to know I’m getting a good price than to sample from the “population” of housing on Craigslist? Sounds like a job for…Python and web scraping! In this article, I’m going to walk you through my code that scrapes East Bay Area Craigslist for apartments. The code here, and/or the URI parameters rather, can be modified to pull from any region, category, property type, etc. Pretty cool, huh? I’m going to share GitHub gists of each cell in the original Jupyter Notebook. If you’d like to just see the whole code at once, clone the repo. Otherwise, enjoy the read and follow along! Getting the DataFirst things first I needed to use the get module from the requests package. Then I defined a variable, response, and assigned it to the get method called on the base URL. What I mean by base URL is the URL at the first page you want to pull data from, minus any extra arguments. I went to the apartments section for the East Bay and checked the “Has Picture” filter to narrow down the search just a little though, so it’s not a true base URL. I then imported BeautifulSoup from bs4, which is the module that can actually parse the HTML of the web page retrieved from the server. I then checked the type and length of that item to make sure it matches the number of posts on the page (there are 120). You can find my import statements and setup code below:It prints out the length of posts which is 120, as the find_all method on the newly created html_soup variable in the code above, I found the posts. I needed to examine the website’s structure to find the parent tag of the posts. Looking at the screenshot below, you can see that it’s

. That is the tag for one single post, which is literally the box that contains all the elements I grabbed! In order to scale this, make sure to work in the following way: grab the first post and all the variables you want from it, make sure you know how to access each of them for one post before you loop the whole page, and lastly, make sure you successfully scraped one page before adding the loop that goes through all sultSet is indexed, so I looked at the first apartment by indexing posts[0]. Surprise, it’s all the code that belongs to that

tag! You should have this output for the first post in posts (posts[0]), assigned to price of the post is easy to () removes whitespace before and after a stringI grabbed the date and time by specifying the attribute ‘datetime’ on class ‘result-date’. By specifying the ‘datetime’ attribute, I saved a step in data cleaning by making it unnecessary to convert this attribute from a string to a datetime object. This could also be made into a one-liner by placing [‘datetime’] at the end of the () call, but I split it into two lines for URL and post title are easy because the ‘href’ attribute is the link and is pulled by specifying that argument. The title is just the text of that number of bedrooms and square footage are in the same tag, so I split these two values and grabbed each one element-wise. The neighborhood is the tag of class “result-hood”, so I grabbed the text of next block is the loop for all the pages for the East Bay. Since there isn’t always information on square footage and number of bedrooms, I built in a series of if statements embedded within the for loop to handle all loop starts on the first page, and for each post in that page, it works through the following logic:I included some data cleaning steps in the loop, like pulling the ‘datetime’ attribute and removing the ‘ft2’ from the square footage variable, and making that value an integer. I removed ‘br’ from the number of bedrooms as that was scraped as well. That way, I started data cleaning with some work already done. Elegant code is the best! I wanted to do more, but the code would become too specific to this region and might not work across code below creates the dataframe from the lists of values! Awesome! There it is. Admittedly, there is still a little bit of data cleaning to be done. I’ll go through that real quick, and then it’s time to explore the data! Exploratory Data AnalysisSadly, after removing the duplicate URLs I saw that there are only 120 instances. These numbers will be different if you run the code, since there will be different posts at different times of scraping. There were about 20 posts that didn’t have bedrooms or square footage listed too. For statistical reasons, this isn’t an incredible data set, but I took note of that and pushed scriptive statistics for the quantitative variablesI wanted to see the distribution of the pricing for the East Bay so I made the above plot. Calling the. describe() method, I got a more detailed look. The cheapest place is $850, and the most expensive is $4, next code block generates a scatter plot, where the points are colored by the number of bedrooms. This shows a clear and understandable stratification: we see layers of points clustered around particular prices and square footages, and as price and square footage increase so do the number of ’s not forget the workhorse of Data Science: linear regression. We can call a regplot() on these two variables to get a regression line with a bootstrap confidence interval calculated about the line and shown as a shaded region with the code below. If you haven’t heard of bootstrap confidence intervals, they are a really cool statistical technique that are worth a looks like we have an okay fit of the line on these two variables. Let’s check the correlations. I called () to get these:Correlation matrix for our variablesAs suspected, correlation is strong between number of bedrooms and square footage. That makes sense since square footage increases as the number of bedrooms icing By Neighborhood ContinuedI wanted to get a sense of how location affects price, so I grouped by neighborhood, and aggregated by calculating the mean for each following is produced with this single line of code: oupby(‘neighborhood’)() where ‘neighborhood’ is the ‘by=’ argument, and the aggregator function is the mean. I noticed that there are two North Oaklands: North Oakland and Oakland North, so I recoded one of them into the other like so:eb_apts[‘neighborhood’]. replace(‘North Oakland’, ‘Oakland North’, inplace=True). Grabbing the price and sorting it in ascending order can show the cheapest and most expensive places to live. The full line of code is now: oupby(‘neighborhood’)()[‘price’]. sort_values() and results in the following output:Average price by neighborhood sorted in ascending orderLastly, I looked at the spread of each neighborhood in terms of price. By doing this, I saw how prices in neighborhoods can vary, and to what ’s the code that produces the plot that rkeley had a huge spread. This is probably because it includes South Berkeley, West Berkeley, and Downtown Berkeley. In a future version of this project it may be important to consider changing the scope of each of the variables so they are more reflective of the variability of price between neighborhoods in each, there you have it! Take a look at this the next time you’re in the market for housing to see what a good price should be. Feel free to check out the repo and try it for yourself, or fork the project and do it for your city! Let me know what you come up with! Scrape you learned something new and would like to pay it forward to the next learner, consider donating any amount you’re comfortable with, thanks! Happy coding! Riley

How to Build A Web Crawler and Scraper for Craigslist Using …

When I built my Hackbright project, the biggest setback for me was obtaining data for it. When an API doesn’t provide what you need, what can you do? The answer: build your own web crawler and scraper. This is where Scrapy, a framework written in Python, comes into play.
(BeautifulSoup is another commonly used web scraper, but it isn’t as robust as Scrapy. I actually did a lightning tech talk on web scraping using BeautifulSoup and Scrapy, and you can check out the slides here, checkout my github code here, or keep reading for the verbose tutorial version. )
WHAT
is a web crawler and scraper? In short, a crawler (aka spider) “crawls” or surfs the web for you, and a scraper extracts data from a particular web page. Put the two together and you can get data from multiple pages automatically and very, very quickly. It’s some powerful shit.
WHY
choose BeautifulSoup or Scrapy? The major advantage here is Python. If this is your language of choice, chances are you’ll want to use BeautifulSoup or Scrapy. There are amazing tutorials out there for BeautifulSoup. In fact it’s relatively simple to use so for the remainder of this post I will only be diving into how to set up Scrapy. This is intended to be a beginner’s guide and we’ll just be scraping (haha) the surface of what Scrapy can be used for.
Helpful things to know before you get started:
Python (and an understanding of object oriented programming and callbacks)
How to use CSS selectors or preferably XPATH selectors and what they are. Scrapy also talks about these a little. Scrapy is fan-freaking-tastic and also provides a command line tool that allows you to check the accuracy of your CSS or XPATH selectors. This is extremely helpful for troubleshooting after you’ve installed Scrapy and created your project in the Initial Setup described below. Once you’re ready to configure your spider and you’re writing your CSS/XPATH selectors, here are some examples to get you started:
First, in your terminal type: $ scrapy shell insert-your-url – this sends a GET request for the URL
Now that you are in the Scrapy Shell, try: $ – this gives you the status code of the response
Or try: $ (‘//title’). extract() – XPATH selector way of saying ‘give me the title of that page! ’
Or: $ (‘title’). extract() – CSS selector way of saying ‘give me the title of that page! ’
HOW
is this all done? The below tutorial is a demonstration of how to use Scrapy to crawl and scrape Craigslist for available rentals on the market. Note: If this tutorial is more than a few years old, the code may not work if the DOM structure of Criagslist has changed. This is built using Scrapy version 1. 3. 1 and Python 2. 7.
Objective
Scrape Craigslist data from the San Francisco apartment rental listings. I want to retrieve the following data for each listing:
The unique craigslist ID
The price
The home attributes such as bedrooms, bathrooms, and sqft.
The neighborhood
The latitude and longitude location of the listing
The date posted
Stream data into my PostgreSQL database directly from scraping
Initial Setup
I assume you have a virtualenv setup, and you know how to activate one of those. Do that now. If not the rest of the steps should work fine, but it’s highly advisable to use a virtualenv.
Install Scrapy.
$ pip install scrapy
Create your project and give it a name. This will create a folder for that project.
$ scrapy startproject insert-name-of-your-project
Change directory into your project folder.
$ cd name-of-your-project-you-created-in-step-3
Create your spider by giving it a name and a start URL.
$ scrapy genspider insert-name-of-your-spider insert-url-to-start-crawling-at
You should now have a directory folder that looks something like this:
├── project-name
│ └── project-name
│ ├──
│ └── spiders
│ └──
└──
Configure Your Spider
Go to your spiders folder and open This is where you’ll do the bulk of your crawling and scraping. The spider should define the initial request (site) to make, (optionally) how to follow the links from the initial request, and how to extract page content.
Here’s the breakdown of what your code should look like and why:
# You’ll need the below modules to create your spider:
from scrapy. spiders import CrawlSpider, Spider, Request, Rule
from nkextractors import LinkExtractor
# Replace “SpiderNameItem” with the class name in your
from import SpiderNameItem
class YourSpiderName(CrawlSpider): # Make sure you’re inheriting from the CrawlSpider class.
name = ‘spidername’ # This should already be filled with your spider’s name.
# Tells your spider to start crawling from this URL.
start_urls = [”]
rules = (
# This rule extracts all links from the start URL page using the XPATH selector.
# In this case it is only looking to extract links that are individual rental
# posting links and not other links to ad sites or elsewhere on Craigslist.
Rule(LinkExtractor(allow=(), restrict_xpaths=(‘//a[@class=”result-title hdrlnk”]’)),
follow=True, callback=’parse_item’),
# This rule says to follow/do a GET request on the extracted link,
# returning the response to the callback function,
# which is the parse_tem() function below.
Rule(LinkExtractor(allow=(), restrict_xpaths=(‘//a[contains(@class, “button next”)]’)),
follow=True, callback=’parse_item’), )
def parse_item(self, response):
“”” Using XPATH, parses data from the HTML responses and returns a dictionary-like item. “””
# Specifies what an Item is; in this case it is referring to the class name
# and instantiating a dictionary for a single item (rental).
item = SpiderNameItem()
# In my item dictionary, insert the following values for each rental.
# The below code uses XPATH selectors to select the specific data of interest
# from the response of each rental link I crawl to.
item[‘cl_id’] = (‘//div/p[@class=”postinginfo”]/text()’). extract()
item[‘price’] = (‘//span/span[contains(@class, “price”)]/text()’). extract()
item[‘attributes’] = (‘//p[contains(@class, “attrgroup”)]/span/b/text()’). extract()
item[‘neighborhood’] = (‘//span/small/text()’). extract()
item[‘date’] = (‘//time[contains(@class, “timeago”)]/@datetime’). extract_first()
item[‘location’] = (‘//a[contains(@href, “)]/@href’). extract()
# Finally, once a single rental page and data has been scraped,
# the crawler yields an iterator item that can later be called.
yield item
Configure Your
Go to your file.
This file uses a dictionary-like API defined as a class to create a dictionary of the data that you collect for each item collected. In this case, each item is every unique rental listing I scrape. For each item, dictionary keys need to be created using ().
Hint: You might want your keys labeled to be similar to your database columns. Here’s what it might look like:
import scrapy
class SpiderNameItem():
# Dictionary keys created for scraping in
cl_id = ()
price = ()
attributes = ()
neighborhood = ()
date = ()
location = ()
# Additional dictionary keys created upon data cleanse in
# These were added fields because I created them later in the pipelines file,
# and they were not collected in the initial scrape using my spider.
# For example, the rental attributes (above) actually contain data on bedrooms,
# bathrooms, and sqft, but it was easier for me to scrape all that data in one chunk
# and separate the code used clean up that data into my file.
bedrooms = ()
bathrooms = ()
sqft = ()
latitude = ()
longitude = ()
latlng = ()
Hint: If you’re building your scraper for the first time it’s helpful to bypass the next steps and check to see if your spider setup is working. To do this, run: $ scrapy crawl insert-name-of-your-spider -o This will create a dump of your item(s) to a JSON file so you can troubleshoot and inspect your data. (In place of you can also use,, or)
Configure Your (If Needed)
is pretty damn awesome. You can use this file to cleanse or validate your data, check for duplicates, and write your data into a database/external file (like JSON or JSON Lines). It’s what I call the place to put all that extra post-crawling/scraping code. You can also get creative and make multiple pipelines for different spiders, etc. Below I’m streaming my data into my PostgreSQL database using SQLAlchemy. Here’s an example of what your code might look like and why:
from scrapy. exceptions import DropItem # Module to handle (drop) bad items
# The below are optional – import all the libraries you need here
import re
import urllib
import geocoder
# This should be customized to import from your model
from model import connect_to_db_scrapy, Rental, UnitDetails
class RentScraperPipeline(object):
“”” Scrubs scraped data of whitespaces and interesting characters
and returns it as a data maintenance-friendly dictionary. “””
def process_item(self, item, spider):
# Gets Craigslist posting ID
if item[‘cl_id’]:
item[‘cl_id’] = (‘\s’, ”(item[‘cl_id’]))[2]
# Finds rental price; if none supplied or if it’s listed at or below $50, drop item
if item[‘price’]:
price = int((‘[^\d. ]+’, ”, ”(item[‘price’])))
if price > 50:
item[‘price’] = price
else:
raise DropItem(‘Missing price in%s’% item)
# Finds bedrooms, bathrooms, and sqft, if provided
if item[‘attributes’]:
attrs = item[‘attributes’]
for i, attribute in enumerate(attrs):
if “BR” in attrs[i]:
item[‘bedrooms’] = float(”(ndall(‘\d+\. *\d*’, attrs[i])))
if “Ba” in attrs[i]:
item[‘bathrooms’] = float(”(ndall(‘\d+\. *\d*’, attrs[i])))
if “BR” not in attrs[i] and “Ba” not in attrs[i]:
item[‘sqft’] = int(attrs[i])
# Get neighborhood name, if provided
if item[‘neighborhood’]:
item[‘neighborhood’] = (‘[^\s\w/\. ]+’, ”, ”(item[‘neighborhood’]))()()
# Get posting date in UTC isoformat time
if item[‘date’]:
item[‘date’] = ”(item[‘date’])
# Find Google maps web address; if none exists drop record
if item[‘location’]:
location_url = urllib. unquote(”(item[‘location’]))(‘utf8’)
find_location = (‘loc:’)
# If location on map is found convert to geolocation; if none exists drop record
if len(find_location) > 1:
location = find_location[1]
geo_location = (location)
# If a geocoded location exists get latitude and longitude, otherwise drop record
if geo_location:
latitude =
longitude =
item[‘latitude’] = latitude
item[‘longitude’] = longitude
item[‘latlng’] = ‘POINT(%s%s)’% (latitude, longitude)
raise DropItem(‘Missing geolocation in%s’% item)
raise DropItem(‘Missing google maps location in%s’% item)
raise DropItem(‘Missing google maps web address in%s’% item)
return item
class PostgresqlPipeline(object):
“”” Writes data to PostgreSQL database. “””
def __init__(self):
“”” Initializes database connection. “””
ssion = connect_to_db_scrapy()
“”” Method used to write data to database directly from the scraper pipeline. “””
cl_id = (‘cl_id’)
price = (‘price’)
date = (‘date’)
neighborhood = (‘neighborhood’)
bedrooms = (‘bedrooms’)
bathrooms = (‘bathrooms’)
sqft = (‘sqft’)
latitude = (‘latitude’)
longitude = (‘longitude’)
latlng = (‘latlng’)
try:
# Create rental details for unit
rental_details = UnitDetails(
neighborhood=neighborhood,
bedrooms=bedrooms,
bathrooms=bathrooms,
sqft=sqft,
latitude=latitude,
longitude=longitude,
latlng=latlng)
# Add rental details to UnitDetails table
(rental_details)
# Create rental unit
rental = Rental(
cl_id=cl_id,
price=price,
date_posted=date,
unitdetails=rental_details)
# Add rental unit to Rental table
(rental)
()
except:
# If the unit already exists in db or if data does not fit db model construct
llback()
raise
finally:
# Close the database session at the end of all tries
Phew! That’s a lot of code! We’re getting close. One last step…
Connecting All The Pieces Together
Check out your folder.
I also really love the settings file. It will be your best friend. There’s a lot of code that is commented out and embedded here is a lot of functionality you can play with, but what you have to configure is up to you. The following highlights what is necessary for this setup.
Uncomment ITEM_PIPELINES if you’re using List out the path and class name of the pipelines using dot notation, and put a number next to it. Conventionally, you use numbers in the hundreds, and each number represents the numerical order that Scrapy will complete each task. So in the below example, my scraper is set to 100 so it will complete before it tries to write to the database (300).
ITEM_PIPELINES = {
‘ntScraperPipeline’: 100,
# I commented this out, but this is an example of what you could do if you had a third pipeline.
# ‘rent_scraper. pipelines. JsonWriterPipeline’: 200,
‘gresqlPipeline’: 300}
Oh, But One More Thing
In your file, some very important features to pay attention to are AUTOTHROTTLE_* and CLOSESPIDER_*.
AUTOTHROTTLE can be used to delay the speed at which the pages are scraped for data.
CLOSESPIDER is useful if you want to automatically shut down the spider after you obtain a certain amount of data or have made a certain number of web requests. This is not included in the default build, but you can add one or both of the following example lines of code to do so:
# Close Spider after number of items scraped
CLOSESPIDER_ITEMCOUNT = 5
# Close spider after number of crawled page responses have been requested
CLOSESPIDER_PAGECOUNT = 10
Why should you care? Well, most companies don’t enjoy being bombarded by webcrawler requests and in fact many will ban you if they find out you’re using up all their bandwidth. Some companies even get a little touchy-feely when it comes to (mis)using their data, so the best thing to do is to be respectful about it. (Note: This is intended to be an educational guide, and I am not responsible for your actions. ) Scrape responsibily – with great power comes great responsibility!
And Lastly… Run That Sucker! (or crawl, if it’s a blood-sucking spider? )
Final step – run your spider using the following command:
$ scrapy crawl insert-name-of-your-spider
YESSSS!!!!!
That’s all folks! Enjoy:)
All of Craigslist

All of Craigslist

enter your words below to search
ADVICE #1: Make sure that the search terms you are considering are highly relevant to your ultimate goal.
Our services are quick and easy. You can use the search bar to find what you want nationwide, but also browse our articles, which might help you search, sell, or learn more.
Million ads posted monthly
Searching for the items you need on Craigslist can be frustrating when the platform only shows you what’s available in your area. With, you can search all Craigslist results from anywhere in the world on any device.
click if you want to find out some history about craig’s list
A few ten minutes are enough for the new announcements to be available throughout the country. Don’t miss a thing with 1 click.
Tips for research
1) To find a specific item, like a blue bmw convertible, you just need to enter these words in the search bar, like this – “blue bmw convertible”.
2) To add words to a search, simply enter them afterwards.
3) If you want to exclude words from a search, you must use the minus sign, for example: Bmw -bleue. The search will then focus on all bmw except blue. You can also use multiple minus signs in your searches.

ProxyBoys