• December 22, 2024

Reddit Crawler Python

How to scrape Reddit with Python – Storybench

Last month, Storybench editor Aleszu Bajak and I decided to explore user data on nootropics, the brain-boosting pills that have become popular for their productivity-enhancing properties. Many of the substances are also banned by at the Olympics, which is why we were able to pitch and publish the piece at Smithsonian magazine during the 2018 Winter Olympics. For the story and visualization, we decided to scrape Reddit to better understand the chatter surrounding drugs like modafinil, noopept and piracetam.
In this Python tutorial, I will walk you through how to access Reddit API to download data for your own project.
This is what you will need to get started:
Python 3. x: I recommend you use the Anaconda distribution for the simplicity with packages. You can also download Python from the project’s website. When following the script, pay special attention to indentations, which are a vital part of Python.
An IDE (Interactive Development Environment) or a Text Editor: I personally use Jupyter Notebooks for projects like this (and it is already included in the Anaconda pack), but use what you are most comfortable with. You can also run scripts from the command-line.
These two Python packages installed: Praw, to connect to the Reddit API, and Pandas, which we will use to handle, format, and export data.
A Reddit account. You can create it here.
The Reddit API
The very first thing you’ll need to do is “Create an App” within Reddit to get the OAuth2 keys to access the API. It is easier than you think.
Go to this page and click create app or create another app button at the bottom left.
This form will open up.
Pick a name for your application and add a description for reference. Also make sure you select the “script” option and don’t forget to put localhost:8080 in the redirect uri field. If you have any doubts, refer to Praw documentation.
Hit create app and now you are ready to use the OAuth2 authorization to connect to the API and start scraping. Copy and paste your 14-characters personal use script and 27-character secret key somewhere safe. You application should look like this:
The “shebang line” and importing packages and modules
We will be using only one of Python’s built-in modules, datetime, and two third-party modules, Pandas and Praw. The best practice is to put your imports at the top of the script, right after the shebang line, which starts with #!. It should look like:
#! usr/bin/env python3
import praw
import pandas as pd
import datetime as dt
The “shebang line” is what you see on the very first line of the script #! usr/bin/env python3. You only need to worry about this if you are considering running the script from the command line. The shebang line is just some code that helps the computer locate python in the memory. It varies a little bit from Windows to Macs to Linux, so replace the first line accordingly:
On Windows, the shebang line is #! python3.
On Linux, the shebang line is #! /usr/bin/python3.
Getting Reddit and subreddit instances
PRAW stands for Python Reddit API Wrapper, so it makes it very easy for us to access Reddit data. First we connect to Reddit by calling the function and storing it in a variable. I’m calling mine reddit. You should pass the following arguments to that function:
reddit = (client_id=’PERSONAL_USE_SCRIPT_14_CHARS’, \
client_secret=’SECRET_KEY_27_CHARS ‘, \
user_agent=’YOUR_APP_NAME’, \
username=’YOUR_REDDIT_USER_NAME’, \
password=’YOUR_REDDIT_LOGIN_PASSWORD’)
From that, we use the same logic to get to the subreddit we want and call the. subreddit instance from reddit and pass it the name of the subreddit we want to access. It can be found after “r/” in the subreddit’s URL. I’m going to use r/Nootropics, one of the subreddits we used in the story.
Also, remember assign that to a new variable like this:
subreddit = breddit(‘Nootropics’)
Accessing the threads
Each subreddit has five different ways of organizing the topics created by redditors:,,. controversial,, and You can also use (“SEARCH_KEYWORDS”) to get only results matching an engine search.
Let’s just grab the most up-voted topics all-time with:
top_subreddit = ()
That will return a list-like object with the top-100 submission in r/Nootropics. You can control the size of the sample by passing a limit to (), but be aware that Reddit’s request limit* is 1000, like this:
top_subreddit = (limit=500)
*PRAW had a fairly easy work-around for this by querying the subreddits by date, but the endpoint that allowed it is soon to be deprecated by Reddit. We will try to update this tutorial as soon as PRAW’s next update is released.
There is also a way of requesting a refresh token for those who are advanced python developers.
Parsing and downloading the data
We are right now really close to getting the data in our hands. Our top_subreddit object has methods to return all kinds of information from each submission. You can check it for yourself with these simple two lines:
for submission in (limit=1):
print(, )
For the project, Aleszu and I decided to scrape this information about the topics: title, score, url, id, number of comments, date of creation, body text. This can be done very easily with a for lop just like above, but first we need to create a place to store the data. On Python, that is usually done with a dictionary. Let’s create it with the following code:
topics_dict = { “title”:[], \
“score”:[], \
“id”:[], “url”:[], \
“comms_num”: [], \
“created”: [], \
“body”:[]}
Now we are ready to start scraping the data from the Reddit API. We will iterate through our top_subreddit object and append the information to our dictionary.
for submission in top_subreddit:
topics_dict[“title”]()
topics_dict[“score”]()
topics_dict[“id”]()
topics_dict[“url”]()
topics_dict[“comms_num”](m_comments)
topics_dict[“created”](eated)
topics_dict[“body”](lftext)
Python dictionaries, however, are not very easy for us humans to read. This is where the Pandas module comes in handy. We’ll finally use it to put the data into something that looks like a spreadsheet — in Pandas, we call those Data Frames.
topics_data = Frame(topics_dict)
The data now looks like this:
Fixing the date column
Reddit uses UNIX timestamps to format date and time. Instead of manually converting all those entries, or using a site like, we can easily write up a function in Python to automate that process. We define it, call it, and join the new column to dataset with the following code:
def get_date(created):
return omtimestamp(created)
_timestamp = topics_data[“created”](get_date)
topics_data = (timestamp = _timestamp)
The dataset now has a new column that we can understand and is ready to be exported.
Exporting a CSV
Pandas makes it very easy for us to create data files in various formats, including CSVs and Excel workbooks. To finish up the script, add the following to the end.
_csv(”, index=False)
That is it. You scraped a subreddit for the first time. Now, let’s go run that cool data analysis and write that story.
If you have any questions, ideas, thoughts, contributions, you can reach me at @fsorodrigues or fsorodrigues [ at] gmail [ dot] com.
Author Recent Posts Felippe is a former law student turned sports writer and a big fan of the Olympics. He is currently a graduate student in Northeastern’s Media Innovation program.
Scraping Reddit data

Scraping Reddit data

How to scrape data from Reddit using the Python Reddit API Wrapper(PRAW)As its name suggests PRAW is a Python wrapper for the Reddit API, which enables you to scrape data from subreddits, create a bot and much this article, we will learn how to use PRAW to scrape posts from different subreddits as well as how to get comments from a specific can be installed using pip or conda:Now PRAW can be imported by writting:import prawBefore it can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a client_id, client_secret and a user_agent get the authentication information we need to create a reddit app by navigating to this page and clicking create app or create another 1: Reddit ApplicationThis will open a form where you need to fill in a name, description and redirect uri. For the redirect uri you should choose localhost:8080 as described in the excellent PRAW 2: Create new Reddit ApplicationAfter pressing create app a new application will appear. Here you can find the authentication information needed to create the 3: Authentication informationNow that we have a instance we can access all available functions and use it, to for example get the 10 “hottest” posts from the Machine Learning [D] What is the best ML paper you read in 2018 and why? [D] Machine Learning – WAYR (What Are You Reading) – Week 53[R] A Geometric Theory of Higher-Order Automatic DifferentiationUC Berkeley and Berkeley AI Research published all materials of CS 188: Introduction to Artificial Intelligence, Fall 2018[Research] Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural can also get the 10 “hottest” posts of all subreddits combined by specifying “all” as the subreddit ‘ve been lying to my wife about film plots for years. I don’t care if this gets downvoted into oblivion! I DID IT REDDIT!! I’ve had enough of your shit, KarenStranger Things 3: Coming July 4th, variable can be iterated over and features including the post title, id and url can be extracted and saved into an 4: Hottest ML postsGeneral information about the subreddit can be obtained using the. description function on the subreddit **[Rules For Posts]()**——–+[Research]()——–+[Discussion]()——–+[Project]()——–+[News]() can get the comments for a post/submission by creating/obtaining a Submission object and looping through the comments attribute. To get a post/submission we can either iterate through the submissions of a subreddit or specify a specific submission using bmission and passing it the submission url or get the top-level comments we only need to iterate over mments will work for some submission, but for others that have more comments this code will throw an AttributeError saying:AttributeError: ‘MoreComments’ object has no attribute ‘body’These MoreComments object represent the “load more comments” and “continue this thread” links encountered on the websites, as described in more detail in the comment get rid of the MoreComments objects, we can check the datatype of each comment before printing the Praw already provides a method called replace_more, which replaces or removes the MoreComments. The method takes an argument called limit, which when set to 0 will remove all of the above code blocks successfully iterate over all the top-level comments and print their body. The output can be seen [()I thought this was a shit post made in paint before I read the titleWow, that’s very cool. To think how keen their senses must be to recognize and avoid each other and their territories. Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a ’s really cool. The edges are surprisingly ever, the comment section can be arbitrarily deep and most of the time we surely also want to get the comments of the comments. CommentForest provides the method, which can be used for getting all comments inside the comment above code will first of output all the top-level comments, followed by the second-level comments and so on until there are no comments is a Python wrapper for the Reddit API, which enables us to use the Reddit API with a clean Python interface. The API can be used for webscraping, creating a bot as well as many article covered authentication, getting posts from a subreddit and getting comments. To learn more about the API I suggest to take a look at their excellent you liked this article consider subscribing on my Youtube Channel and following me on social code covered in this article is available as a Github you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.
How to Scrape Reddit using Python Scrapy | Proxies API

How to Scrape Reddit using Python Scrapy | Proxies API

Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless lets see how we can scrape Reddit to get new posts from a subreddit like r/, we need to install scrapy if you haven’t install scrapyOnce installed, go ahead and create a project by invoking the startproject startproject scrapingprojectThis will ouput something like Scrapy project ‘scrapingproject’, using template directory ‘/Library/Python/2. 7/site-packages/scrapy/templates/project’, created in:
/Applications/MAMP/htdocs/scrapy_examples/scrapingproject
You can start your first spider with:
cd scrapingproject
scrapy genspider example create a folder structure like CD into the scrapingproject. You will need to do it twice like scrapingproject
cd scrapingprojectNow we need a spider to crawl through the programming subreddit. So we use the genspider to tell scrapy to create one for us. We call the spider ourfirstbot and pass it the url of the subredditscrapy genspider ourfirstbot should return successfull like this Created spider ‘ourfirstbot’ using template ‘basic’ in module:
scrapingproject. spiders. ourfirstbotGreat. Now open the file in the spiders folder… it should look like this… # -*- coding: utf-8 -*-
import scrapy
class OurfirstbotSpider():
name = ‘ourfirstbot’
allowed_domains = [”]
start_urls = [”]
def parse(self, response):
passLets examine this code before we allowed_domains array restricts all further crawling to the domain paths specified art_urls is the list of urls to crawl… for us, in this example, we only need one def parse(self, response): function is called by scrapy after every successfull url crawl. Here is where we can write our code to extract the data we now need to find the css selector of the elements we need to extract the data. Go to the url and right click on the Title of one of the posts and click on inspect. This will open thje Google Chrome Inspector like can see that the css class name of the title is _eYtD2XCVieq6emjKBH3m so we are going to ask to ask scrapy to get us the text property of this class like = (‘. _eYtD2XCVieq6emjKBH3m::text’). extract()Similarly, we try and find the class names of the votes element and the number of comments element (note that the class names might change by the time you run this = (‘. _1rZYMD_4xY3gRcSS3p8ODO::text’). extract()
comments = (‘. FHCV02u6Cp2zYL0fhQPsO::text’). extract()If you are unfaimiliar with css selectors, you can refer to this page by Scrapy have to now use the zip function to map the similar index of multiple containers so that they can be used just using as single entity. so here is how it looks. # -*- coding: utf-8 -*-
start_urls = [
”, ]
#yield response
titles = (‘. extract()
votes = (‘. extract()
#Give the extracted content row wise
for item in zip(titles, votes, comments):
#create a dictionary to store the scraped info
all_items = {
‘title’: item[0],
‘vote’: item[1],
‘comments’: item[2], }
#yield or give the scraped info to scrapy
yield all_itemsAnd now lets run this with the command crawl ourfirstbotAnd Bingo… you get the results as lets export the extracted data to a csv file. All you have to do is to provide the export file like this scrapy crawl ourfirstbot -o if you want the data in the JSON crawl ourfirstbot -o data. jsonScaling ScrapyThe example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds from websites like Reddit you will find that sooner or later your access will be restricted. Reddit can tell you are a bot so one of the things you can do is to run the crawler impersonating a web browser. This is done by passing the user agent string to the Reddit webserver so it doesnt block you. Like crawl ourfirstbot -s USER_AGENT=”Mozilla/5. 0 (Windows NT 6. 1; WOW64)/
AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/34. 0. 1847. 131 Safari/537. 36″ /
-s ROBOTSTXT_OBEY=FalseIn more advanced implementations you will need to even rotate this string so Reddit cant tell its the same browser! Welcome to web we get a little bit more advanced, you will realise that Reddit can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects vesting in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache free web scraping project which gets the job done consistently and one that never really with the 1000 free API calls running offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems millions of high speed rotating proxies located all over the world, With our automatic IP rotationWith our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)With our automatic CAPTCHA solving technology, hundreds of our customers have successfully solved the headache of IP blocks with a simple whole thing can be accessed by a simple API like below in any programming “We have a running offer of 1000 API calls completely free. Register and get your free API Key you have an API_KEY from Proxies API, you just have to change your code to this… # -*- coding: utf-8 -*-
yield all_itemsWe have only changed one line at the start_urls array and that will make sure we will never have to worry about IP rotation, user agent string rotation or even rate limits ever again.

Frequently Asked Questions about reddit crawler python

Leave a Reply