Craigslist Data Mining
Anybody know of a Craigslist data mining tool? – Reddit
I thought there was a resource out there that compiled stuff like how many posts there were for say, something like cars for sale that was granular down to city, neighborhood, price range etc. Kept historically of course. There’s plenty out there for tracking individual postings but I can’t find any large craigslist data quick backstory. A while back a guy in Askreddit responded to a question on how to make some quick money. He had a bunch of ideas but one of them was how he bought low priced dining room tables year round off craigslist and then during the couple of weeks leading up to thanksgiving and christmas (holidays that require dining room tables for big meals) he would sell his stash for decent profit. That’s just something that never occurred to me. I think it would very interesting to find some other Craigslist trends.
Web Scraping Craigslist: A Complete Tutorial | by Riley Predum
I’ve been looking to make a move recently. And what better way to know I’m getting a good price than to sample from the “population” of housing on Craigslist? Sounds like a job for…Python and web scraping! In this article, I’m going to walk you through my code that scrapes East Bay Area Craigslist for apartments. The code here, and/or the URI parameters rather, can be modified to pull from any region, category, property type, etc. Pretty cool, huh? I’m going to share GitHub gists of each cell in the original Jupyter Notebook. If you’d like to just see the whole code at once, clone the repo. Otherwise, enjoy the read and follow along! Getting the DataFirst things first I needed to use the get module from the requests package. Then I defined a variable, response, and assigned it to the get method called on the base URL. What I mean by base URL is the URL at the first page you want to pull data from, minus any extra arguments. I went to the apartments section for the East Bay and checked the “Has Picture” filter to narrow down the search just a little though, so it’s not a true base URL. I then imported BeautifulSoup from bs4, which is the module that can actually parse the HTML of the web page retrieved from the server. I then checked the type and length of that item to make sure it matches the number of posts on the page (there are 120). You can find my import statements and setup code below:It prints out the length of posts which is 120, as the find_all method on the newly created html_soup variable in the code above, I found the posts. I needed to examine the website’s structure to find the parent tag of the posts. Looking at the screenshot below, you can see that it’s
How I Built a Python Bot to Help Me Find an Apartment in San Francisco
I moved from Boston to the Bay Area a few months ago. Priya (my girlfriend) and I heard all sorts of horror stories about the rental market. The fact that searching for “How to find an apartment in San Francisco” on Google yields dozens of pages of advice is a good indicator that apartment hunting is a painful process.
Boston is cold, but finding an apartment in SF is scary. We read that landlords hold open houses, and that you have to bring all of your paperwork to the open house and be willing to put down a deposit immediately to even be considered.
We started exhaustively researching the process, and figured out that a lot of finding an apartment comes down to timing. Some landlords want to hold an open house no matter what, but for others, being one of the first people to see the apartment usually means that you can get it. You eneed to find the listing, quickly figure out if it meets your criteria, then call the landlord to arrange a showing to have a shot.
We looked around at some of the apartment rental sites recommended by internet posters, like Padmapper and LiveLovely, but none of them gave us a feed of real-time listings that we could look at and rate together. None of them gave us the ability to specify additional criteria, like very specific neighborhoods only, or proximity to transportation. As most apartment listings in the Bay Area are originally on Craigslist, then scraped by the other sites, there’s also a fear that maybe not all the listings are scraped, or that they’re not scraped quickly enough to make the alerts real-time. We wanted a way to:
Get notified in near real-time when a posting was made to Craigslist.
Filter out listings that didn’t fall into our desired neighborhoods.
Filter out listings that didn’t match additional criteria, like proximity to public transit.
Collaborate on listings and rate them together.
Easily get in touch with the landlord for listings we liked.
After thinking about the problem, I realized that we could solve the problem with a four step process:
Scrape listings from Craigslist.
Filter out listings that don’t match our criteria.
Post the listings to Slack, a team chat tool, so we can discuss and rate them.
Wrap the whole process into a persistent loop and deploy it to a server (so it would run continuously).
In the rest of this post, we’ll walk through how each piece was built, and how the final Slack bot was used to help us find an apartment. Using this bot, Priya and I found a reasonably priced (for SF! ) one bedroom that we love in about a week, far less time than we thought it would take.
If you want to look at the code as you go through this post, here’s a link to the finished project, and here’s a link to the
Step one — Scraping listings from Craigslist
The first step in building our bot is to get listings from Craiglist. Craiglist unfortunately doesn’t have an API, but we can get posts using the python-craiglist package. python-craigslist scrapes the content of the page, then uses BeautifulSoup to extract relevant pieces from the page, and convert it to structured data.
The code for the package is fairly short, and worth a read-through. Craigslist apartment listings for San Francisco are located at. In the below code, we:
Import CraigslistHousing, a class in python-craigslist.
Initialize the class with the following arguments:
site — the Craigslist site that we want to scrape. The site is the first part of the URL, like
area — the subarea within the site that we want to scrape. The area is the last part of a URL, like, which will only look in San Francisco.
category — the type of listing we want to look for. The category is the last part of a search URL, like, which lists all the apartments.
filters — any filters we want to apply to the results.
max_price — the maximum price we’re willing to pay.
min_price — the minimum price we want to look for.
Get the results from Craigslist using the get_results method, which is a generator.
Pass the geotagged argument to attempt to add coordinates to each result.
Pass the limit argument to only get 20 results.
Pass the newest argument to only get the newest listings.
Get each result from the results generator and print it.
from craigslist import CraigslistHousing
cl = CraigslistHousing(site=’sfbay’, area=’sfc’, category=’apa’, filters={‘max_price’: 2000, ‘min_price’: 1000})
results = t_results(sort_by=’newest’, geotagged=True, limit=20)
for result in results:
print result
We’ve finished the first step of the bot pretty quickly! We’re now able to scrape Craigslist and get listings. Each result is a dictionary with several fields:
{‘datetime’: ‘2016-07-20 16:39’,
‘geotag’: (37. 783166, -122. 418671),
‘has_image’: True,
‘has_map’: True,
‘id’: ‘5692904929’,
‘name’: ‘Be the first in line at Brendas restaurant! SQuiet studio
available’,
‘price’: ‘$1995’,
‘url’: ”,
‘where’: ‘tenderloin’}
Here’s a description of the fields:
datetime — when the listing was posted.
geotag — the coordinate location of the listing.
has_image — whether there’s an image in the Craigslist posting.
has_map — whether there’s a map associated with the listing.
id — the unique Craigslist id for the listing.
name — the name of the listing that shows up on Craigslist.
price — the monthly rent.
url — the URL to view the full listing.
where — what the person who created the listing put in for where it is.
Step two — Filtering the results
Now that we have a way to get listings from Craigslist, we just need a way to filter them and only see the ones we like.
Filtering the results by area
When Priya and I were searching for apartments, we wanted to look at places in a few areas, including:
San Francisco
Sunset
Pacific Heights
Lower Pacific Heights
Bernal Heights
Richmond
Berkeley
Oakland
Adams Point
Lake Merritt
Rockridge
Alameda
In order to filter by neighborhood, we’ll first need to define bounding boxes boxes around the areas:
Drawing a box around Lower Pacific Heights
The bounding box above was created using BoundingBox. Be sure to specify the CSV option in the bottom left to get the coordinates of the box. You can also define a bounding box yourself by finding the coordinates for the bottom left and the top right using a tool like Google Maps. After finding the boxes, we’ll create a dictionary of neighborhoods and coordinates:
BOXES = {
“adams_point”: [
[37. 80789, -122. 25000],
[37. 81589, -122. 26081], ],
“piedmont”: [
[37. 82240, -122. 24768],
[37. 83237, -122. 25386], ],… }
Each dictionary key is a neighborhood name, and each key contains a list of lists. The first inner list is the coordinates of the bottom left of the box, and the second is the coordinates of the top right of the box. We can then perform the filtering by checking to see if the coordinates for a listing are inside any of the boxes. The below code will:
Loop through each of the keys in BOXES.
Check to see if the result is inside the box.
Set the appropriate variables if so.
def in_box(coords, box):
if box[0][0] < coords[0] < box[1][0] and box[1][1] < coords[1] < box[0][1]:
return True
return False
geotag = result["geotag"]
area_found = False
area = ""
for a, coords in ():
if in_box(geotag, coords):
area = a
area_found = True
Unfortunately, not all results from Craigslist will have coordinates associated with them. It’s up to the person who posts the listing to specify a location, which coordinates can be calculated from. The more familiar the person posting the listing is with Craigslist, the more likely they are to include a location.
Usually, the listings that are posted by agents who are more likely to charge high rent have associated locations. The postings by owners are more likely to not have coordinates, but are also usually better deals. Thus, it makes sense to have a failsafe to figure out if listings without coordinates associated with them are in the neighborhoods we want.
We’ll create a list of neighborhoods, then do string matching to see if the listing falls into one of them. This is less accurate than using coordinates, because many listings misreport their neighborhood, but it’s better than nothing:
NEIGHBORHOODS = ["berkeley north", "berkeley", "rockridge", "adams point",... ]
To do name-based matching, we can loop through each of the NEIGHBORHOODS:
location = result["where"]
for hood in NEIGHBORHOODS:
if hood in ():
area = hood
Once a result has been processed by the two parts of code we’ve written so far, we’ll have removed any listings that aren’t in the neighborhoods we want to live in. We’ll have a few false positives, and we may miss some listings that don’t have a neighborhood or location specified, but this system catches the vast majority of listings.
Filtering the results by proximity to transit
Priya and I knew we’d both be traveling to San Francisco a lot, so we wanted to live near public transit if we weren’t going to be SF. In the Bay Area, the main form of public transit is called BART.
BART is a partially underground regional transit system that connects Oakland, Berkeley, San Francisco, and the surrounding areas. In order to build this functionality into our bot, we’ll first need to define a list of transit stations. We can get the coordinates of transit stops from Google Maps, then create a dictionary of them:
TRANSIT_STATIONS = {
"oakland_19th_bart": [37. 8118051, -122. 2720873],
"macarthur_bart": [37. 8265657, -122. 2686705],
"rockridge_bart": [37. 841286, -122. 2566329],... }
Every key is the name of a transit station, and has an associated list. The list contains the latitude and longitude of the transit station. Once we have the dictionary, we find the closest transit station to each result. The below code will:
Loop through each key and item in TRANSIT_STATIONS.
Use the coord_distance function to find the distance in kilometers between two pairs of coordinates. You can find an explanation of this function here.
Check to see if the station is the closest to the listing.
If the station is too far (farther than 2 kilometers, or about 1. 2 miles), it is ignored.
If the station is closer than the previous closest station, it’s used.
min_dist = None
near_bart = False
bart_dist = "N/A"
bart = ""
MAX_TRANSIT_DIST = 2 # kilometers
for station, coords in ():
dist = coord_distance(coords[0], coords[1], geotag[0], geotag[1])
if (min_dist is None or dist < min_dist) and dist < MAX_TRANSIT_DIST:
bart = station
near_bart = True
if (min_dist is None or dist < min_dist):
bart_dist = dist
After this, we know the closest transit station to each listing.
Step three — Creating our Slack Bot
Setup
After we filter down our results, we’re ready to post what we have to Slack. If you’re unfamiliar with Slack, it’s a team chat application. You create a team in Slack, and can then invite members. Each Slack team can have multiple channels where members exchange messages. Each message can be annotated by other people in the channel, such as adding a thumbs up or other emoticons. Here’s more information on Slack.
If you want to get a feel for Slack, we run a data science Slack community that you might like to join. By posting our results to Slack, we’ll be able to collaborate with others and figure out which listings are the best. To do this, we’ll need to:
Create a Slack team, which we can do here.
Create a channel for the listings to be posted into. Here’s help on this. It’s suggested to use #housing as the name of the channel.
Get a Slack API token, which we can do here. Here’s more information on the process.
After these steps, we’re ready to create the code that posts the listings to Slack.
Coding it up
After getting the right channel name and token, we can post our results to Slack. To do this, we’ll use python-slackclient, a Python package that makes it easy to use the Slack API. python-slackclient is initialized using a Slack token, then gives us access to many API endpoints that manage the team and messages. The below code will:
Initialize a SlackClient using the SLACK_TOKEN.
Create a message string from the result containing all the information we need to see, such as price, the neighborhood the listing is in, and the URL.
Post the message to Slack with the username pybot, and a robot as an avatar.
from slackclient import SlackClient
SLACK_TOKEN = "ENTER_TOKEN_HERE"
SLACK_CHANNEL = "#housing"
sc = SlackClient(SLACK_TOKEN)
desc = "{0} | {1} | {2} | {3}
| <{4}>“(result[“area”], result[“price”], result[“bart_dist”], result[“name”], result[“url”])
sc. api_call(
“Message”, channel=SLACK_CHANNEL, text=desc,
username=’pybot’, icon_emoji=’:robot_face:’)
Once everything is hooked up, the Slack bot will post listings into Slack that look like this:
How listings will look when the bot is running. Note how you can annotate listings with emoticons, like the thumbs up.
Step four — Operationalizing everything
Now that we have the basics nailed out, we’ll need to run the code persistently. After all, we want our results to be posted to Slack in real-time, or close to it. In order to operationalize everything, we’ll need to go through a few steps:
Store the listings in the database, so we don’t post duplicates into Slack.
Separate the settings, like SLACK_TOKEN, from the rest of the code to make them easy to adjust.
Create a loop that will run continuously, so we’re scraping results 24/7.
Storing listings
The first step is to use a Python package called SQLAlchemy to store our listings. SQLAlchemy is an Object Relational Mapper, or ORM, that makes it easier to work with databases from Python. Using SQLAlchemy, we can create a database table that will store listings, and a database connection to make it easy to add data to the table.
We’ll use SQLAlchemy in conjunction with the SQLite database engine, which will store all of our data into a single file called The below code will:
Import SQLAlchemy.
Create a connection to the SQLite database that will be created in our current directory.
Define a table called Listing that contains all the relevant fields from a Craigslist listing.
The unique fields cl_id and link will prevent us from posting duplicate listings to Slack.
Create a database session from the connection, which will allow us to store listings.
from sqlalchemy import create_engine
from import declarative_base
from sqlalchemy import Column, Integer, String, DateTime, Float, Boolean
from import sessionmaker
engine = create_engine(‘sqlite/’, echo=False)
Base = declarative_base()
class Listing(Base):
“””
A table to store data on craigslist listings.
__tablename__ = ‘listings’
id = Column(Integer, primary_key=True)
link = Column(String, unique=True)
created = Column(DateTime)
geotag = Column(String)
lat = Column(Float)
lon = Column(Float)
name = Column(String)
price = Column(Float)
location = Column(String)
cl_id = Column(Integer, unique=True)
area = Column(String)
bart_stop = Column(String)
eate_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
Now that we have our database model, we’ll just need to store each listing into the database, and we’ll be able to avoid duplicates.
Separating configuration from code
The next step is to separate the configuration from the code. We’ll create a file called that stores our configuration. Configuration includes SLACK_TOKEN, which is a secret, and we don’t want to commit to git accidentally and push to Github, as well as other settings like BOXES that aren’t secret, but we want to be able to edit easily. We’ll move the following settings to
MIN_PRICE — the minimum listing price we want to search for.
MAX_PRICE — the minimum listing price we want to search for.
CRAIGSLIST_SITE — the regional Craigslist site we want to search in.
AREAS — a list of areas of the regional Craiglist site that we want to search in.
BOXES — coordinate boxes of the neighborhoods we want to look in.
NEIGHBORHOODS — if the listing doesn’t have coordinates, a list of neighborhoods to match on.
MAX_TRANSIT_DIST — the farthest we want to be from a transit station.
TRANSIT_STATIONS — the coordinates of transit stations.
CRAIGSLIST_HOUSING_SECTION — the subsection of Craigslist housing that we want to look in.
SLACK_CHANNEL — the Slack channel we want the bot to post in.
We’ll also want to create a file called, that is ignored by git, and contains the following key:
SLACK_TOKEN — the token to post to our Slack team.
You can see the finished file here.
Create a loop
Finally, we’ll need to create a loop that runs our scraping code continuously. The below code will:
When called from the command line:
Print a status message containing the current time.
Run the craigslist scraping code by calling the do_scrape function.
Quit if the user types Ctrl + C.
Handle other exceptions by printing the traceback and continuing.
If no exceptions, print a success message (corresponds to the else clause below).
Sleeping for a defined interval before scraping again. By default, this is set to 20 minutes.
from scraper import do_scrape
import settings
import time
import sys
import traceback
if __name__ == “__main__”:
while True:
print(“{}: Starting scrape cycle”(()))
try:
do_scrape()
except KeyboardInterrupt:
print(“Exiting…. “)
(1)
except Exception as exc:
print(“Error with the scraping:”, sys. exc_info()[0])
int_exc()
else:
print(“{}: Successfully finished scraping”(()))
(EEP_INTERVAL)
We’ll also need to add SLEEP_INTERVAL to in order to control how often the scraping happens. By default, this is set to 20 minutes.
Running it yourself
Now that the code is wrapped up, let’s look into how you can run the Slack bot yourself.
Running on your local computer
You can find the project on Github here. In the, you’ll find detailed installation instructions. Unless you’re experienced installing programs, and are running Linux, it’s suggested to follow the Docker instructions.
Docker is a tool that makes it easy to create and deploy applications, and makes it very fast to get started with this Slack bot on your local machine. Here are basic instructions for installing and running the Slack bot with Docker:
Create a folder called config, then put a file called inside.
Any settings you specify in will override the defaults that are in
By adding settings in, you can customize the behavior of the bot.
Specify new values for any of the settings above in
For example, you could put AREAS = [‘sfc’] in to only look in San Francisco.
If you want to post into a Slack channel not called housing, add an entry for SLACK_CHANNEL.
If you don’t want to look in the Bay Area, you’ll need to update the following settings at the minimum:
CRAIGSLIST_SITE
AREAS
BOXES
NEIGHBORHOODS
TRANSIT_STATIONS
CRAIGSLIST_HOUSING_SECTION
MIN_PRICE
MAX_PRICE
Install Docker by following these instructions.
To run the bot with the default configuration:
docker run -d -e SLACK_TOKEN={YOUR_SLACK_TOKEN} dataquestio/apartment-finder
To run the bot with your own configuration:
docker run -d -e SLACK_TOKEN={YOUR_SLACK_TOKEN} -v {ABSOLUTE_PATH_TO_YOUR_CONFIG_FOLDER}:/opt/wwc/apartment-finder/config dataquestio/apartment-finder
Deploying the bot
Unless you want to leave your computer on 24/7, it makes sense to deploy the bot to a server, so it can run continuously. We can create a server on a hosting provider called DigitalOcean. Digital Ocean can automatically create a server with Docker installed. Here’s a guide on how to get started with Docker on DigitalOcean.
If you don’t know what the author means by “shell”, here’s a tutorial on how to SSH into a DigitalOcean droplet. If you don’t want to follow a guide, you can also get started here. After creating a server on DigitalOcean, you can ssh into the server, then follow the Docker installation and usage instructions above.
Next steps
After following the steps above, you should have a Slack bot that finds apartments for you automatically. Using this bot, Priya and I found a great apartment in San Francisco for more than we hoped to pay, but less than we thought a one bedroom in SF would end up costing. It also took us a lot less time than we’d expected it to. Even though it worked for us, there are quite a few extensions that could be made to improve the bot:
Taking thumbs up and thumbs down from Slack, and training a machine learning model.
Automatically pulling the locations of transit stops from an API.
Adding in points of interest like parks and other items.
Adding in the walkscore or other neighborhood quality scores, like crime.
Automatically extracting landlord phone numbers and emails.
Automatically calling landlords and scheduling showings (if someone does this, you’re awesome).
Feel free to submit pull requests to the project on Github, and please let me know if this tool is helpful for you. Looking forward to seeing how you use it! When I’m not building slackbots to help me find apartments, I’m working on my startup Dataquest, the best online platform for learning Python and Data Science. If that interests you, you can sign up and complete our basic courses for free.
Tagsadvanced, apartment, apartment search, apartments, automation, bot, bots, Data Science Projects, Docker, Learn Python, python, real estate, slack, slack bot, tutorial, TutorialsYou may also like
Frequently Asked Questions about craigslist data mining
How do you scrape data on Craigslist?
Craiglist unfortunately doesn’t have an API, but we can get posts using the python-craiglist package. … site — the Craigslist site that we want to scrape. The site is the first part of the URL, like https://sfbay.craigslist.org .Jul 21, 2016
Does Craigslist have an API?
Web scraping refers to the process of extracting data from web sources and structuring it into a more convenient format. … Data mining refers to the process of analyzing large datasets to uncover trends and valuable insights. It does not involve any data gathering or extraction.Mar 2, 2020