• November 9, 2024

How To Use Scrapy

Scrapy Tutorial — Scrapy 2.5.1 documentation

Scrapy Tutorial — Scrapy 2.5.1 documentation

In this tutorial, we’ll assume that Scrapy is already installed on your system.
If that’s not the case, see Installation guide.
We are going to scrape, a website
that lists quotes from famous authors.
This tutorial will walk you through these tasks:
Creating a new Scrapy project
Writing a spider to crawl a site and extract data
Exporting the scraped data using the command line
Changing spider to recursively follow links
Using spider arguments
Scrapy is written in Python. If you’re new to the language you might want to
start by getting an idea of what the language is like, to get the most out of
Scrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you’re new to programming and want to start with Python, the following books
may be useful to you:
Automate the Boring Stuff With Python
How To Think Like a Computer Scientist
Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers,
as well as the suggested resources in the learnpython-subreddit.
Creating a project¶
Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you’d like to store your code and run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
# deploy configuration file
tutorial/ # project’s Python module, you’ll import your code from here
# project items definition file
# project middlewares file
# project pipelines file
# project settings file
spiders/ # a directory where you’ll later put your spiders
Our first Spider¶
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or a group of websites). They must subclass
Spider and define the initial requests to make,
optionally how to follow links in the pages, and how to parse the downloaded
page content to extract data.
This is the code for our first Spider. Save it in a file named
under the tutorial/spiders directory in your project:
import scrapy
class QuotesSpider():
name = “quotes”
def start_requests(self):
urls = [
”,
”, ]
for url in urls:
yield quest(url=url, )
def parse(self, response):
page = (“/”)[-2]
filename = f’quotes-{page}’
with open(filename, ‘wb’) as f:
()
(f’Saved file {filename}’)
As you can see, our Spider subclasses
and defines some attributes and methods:
name: identifies the Spider. It must be
unique within a project, that is, you can’t set the same name for different
Spiders.
start_requests(): must return an iterable of
Requests (you can return a list of requests or write a generator function)
which the Spider will begin to crawl from. Subsequent requests will be
generated successively from these initial requests.
parse(): a method that will be called to handle
the response downloaded for each of the requests made. The response parameter
is an instance of TextResponse that holds
the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting
the scraped data as dicts and also finding new URLs to
follow and creating new requests (Request) from them.
How to run our spider¶
To put our spider to work, go to the project’s top level directory and run:
This command runs the spider with name quotes that we’ve just added, that
will send some requests for the domain. You will get an output
similar to this:… (omitted for brevity)
2016-12-16 21:24:05 [] INFO: Spider opened
2016-12-16 21:24:05 [scrapy. extensions. logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [] DEBUG: Telnet console listening on 127. 0. 1:6023
2016-12-16 21:24:05 [] DEBUG: Crawled (404) (referer: None)
2016-12-16 21:24:05 [] DEBUG: Crawled (200) (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file
2016-12-16 21:24:05 [] INFO: Closing spider (finished)…
Now, check the files in the current directory. You should notice that two new
files have been created: and, with the content
for the respective URLs, as our parse method instructs.
Note
If you are wondering why we haven’t parsed the HTML yet, hold
on, we will cover that soon.
What just happened under the hood? ¶
Scrapy schedules the quest objects
returned by the start_requests method of the Spider. Upon receiving a
response for each one, it instantiates Response objects
and calls the callback method associated with the request (in this case, the
parse method) passing the response as argument.
A shortcut to the start_requests method¶
Instead of implementing a start_requests() method
that generates quest objects from URLs,
you can just define a start_urls class attribute
with a list of URLs. This list will then be used by the default implementation
of start_requests() to create the initial requests
for your spider:
start_urls = [
The parse() method will be called to handle each
of the requests for those URLs, even though we haven’t explicitly told Scrapy
to do so. This happens because parse() is Scrapy’s
default callback method, which is called for requests without an explicitly
assigned callback.
Storing the scraped data¶
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -O
That will generate a file containing all scraped items,
serialized in JSON.
The -O command-line switch overwrites any existing file; use -o instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as JSON Lines:
scrapy crawl quotes -o
The JSON Lines format is useful because it’s stream-like, you can easily
append new records to it. It doesn’t have the same problem of JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like JQ to help
doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an Item Pipeline. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
tutorial/ Though you don’t need to implement any item
pipelines if you just want to store the scraped items.
Following links¶
Let’s say, instead of just scraping the stuff from the first two pages
from, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links
from them.
First thing is to extract the link to the page we want to follow. Examining
our page, we can see there is a link to the next page with the following
markup:

We can try extracting it in the shell:
>>> (‘ a’)()
Next
This gets the anchor element, but we want the attribute href. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:
>>> (‘ a::attr(href)’)()
‘/page/2/’
There is also an attrib property available
(see Selecting element attributes for more):
>>> (‘ a’)[‘href’]
Let’s see now our spider modified to recursively follow the link to the next
page, extracting data from it:
for quote in (”):
yield {
‘text’: (”)(),
‘author’: (”)(),
‘tags’: (‘ ‘)(), }
next_page = (‘ a::attr(href)’)()
if next_page is not None:
next_page = response. urljoin(next_page)
yield quest(next_page, )
Now, after extracting the data, the parse() method looks for the link to
the next page, builds a full absolute URL using the
urljoin() method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it’s
visiting.
In our example, it creates a sort of loop, following all the links to the next page
until it doesn’t find one – handy for crawling blogs, forums and other sites with
pagination.
A shortcut for creating Requests¶
As a shortcut for creating Request objects you can use
‘author’: (‘span small::text’)(),
yield (next_page, )
Unlike quest, supports relative URLs directly – no
need to call urljoin. Note that just returns a Request
instance; you still have to yield this Request.
You can also pass a selector to instead of a string;
this selector should extract necessary attributes:
for href in (‘ a::attr(href)’):
yield (href, )
For elements there is a shortcut: uses their href
attribute automatically. So the code can be shortened further:
for a in (‘ a’):
yield (a, )
To create multiple requests from an iterable, you can use
llow_all instead:
anchors = (‘ a’)
yield from llow_all(anchors, )
or, shortening it further:
yield from llow_all(css=’ a’, )
More examples and patterns¶
Here is another spider that illustrates callbacks and following links,
this time for scraping author information:
class AuthorSpider():
name = ‘author’
start_urls = [”]
author_page_links = (‘ + a’)
yield from llow_all(author_page_links, rse_author)
pagination_links = (‘ a’)
yield from llow_all(pagination_links, )
def parse_author(self, response):
def extract_with_css(query):
return (query)(default=”)()
‘name’: extract_with_css(”),
‘birthdate’: extract_with_css(”),
‘bio’: extract_with_css(”), }
This spider will start from the main page, it will follow all the links to the
authors pages calling the parse_author callback for each of them, and also
the pagination links with the parse callback as we saw before.
Here we’re passing callbacks to
llow_all as positional
arguments to make the code shorter; it also works for
Request.
The parse_author callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don’t need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS.
Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the CrawlSpider class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,
using a trick to pass additional data to the callbacks.
Using spider arguments¶
You can provide command line arguments to your spiders by using the -a
option when running them:
scrapy crawl quotes -O -a tag=humor
These arguments are passed to the Spider’s __init__ method and become
spider attributes by default.
In this example, the value provided for the tag argument will be available
via You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:
url = ”
tag = getattr(self, ‘tag’, None)
if tag is not None:
url = url + ‘tag/’ + tag
yield quest(url, )
‘author’: (”)(), }
If you pass the tag=humor argument to this spider, you’ll notice that it
will only visit URLs from the humor tag, such as
You can learn more about handling spider arguments here.
Next steps¶
This tutorial covered only the basics of Scrapy, but there’s a lot of other
features not mentioned here. Check the What else? section in
Scrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn’t covered like
modeling the scraped data. If you prefer to play with an example project, check
the Examples section.
Implementing Web Scraping in Python with Scrapy - GeeksforGeeks

Implementing Web Scraping in Python with Scrapy – GeeksforGeeks

Nowadays data is everything and if someone wants to get data from webpages then one way to use an API or implement Web Scraping techniques. In Python, Web scraping can be done easily by using scraping tools like BeautifulSoup. But what if the user is concerned about performance of scraper or need to scrape data overcome this problem, one can make use of MultiThreading/Multiprocessing with BeautifulSoup module and he/she can create spider, which can help to crawl over a website and extract data. In order to save the time one use the help of Scrapy one can:
1. Fetch millions of data efficiently
2. Run it on server
3. Fetching data
4. Run spider in multiple processesScrapy comes with whole new features of creating spider, running it and then saving data easily by scraping it. At first it looks quite confusing but it’s for the ’s talk about the installation, creating a spider and then testing 1: Creating virtual environmentIt is good to create one virtual environment as it isolates the program and doesn’t affect any other programs present in the machine. To create virtual environment first install it by using:sudo apt-get install python3-venvCreate one folder and then activate it:mkdir scrapy-project && cd scrapy-project
python3 -m venv myvenv
If above command gives Error then try this:python3. 5 -m venv myvenvAfter creating virtual environment activate it by using:source myvenv/bin/activate Step 2: Installing Scrapy moduleInstall Scrapy by using:pip install scrapyTo install scrapy for any specific version of python:python3. 5 -m pip install scrapyReplace 3. 5 version with some other version like 3. 6. Step 3: Creating Scrapy projectWhile working with Scrapy, one needs to create scrapy startproject gfgIn Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name python file. Step 4: Creating SpiderMove to the spider folder and create While creating spider, always create one class with unique name and define requirements. First thing is to name the spider by assigning it with name variable and then provide the starting URL through which spider will start crawling. Define some methods which helps to crawl much deeper into that website. For now, let’s scrap all the URL present and store all those scrapyclass ExtractUrls(): name = “extract” def start_requests(self): for url in urls: yield quest(url = url, callback =)Main motive is to get each url and then request it. Fetch all the urls or anchor tags from it. To do this, we need to create one more method parse, to fetch data from the given url. Step 5: Fetching data from given pageBefore writing parse function, test few things like how to fetch any data from given page. To do this make use of scrapy shell. It is just like python interpreter but with the ability to scrape data from the given url. In short, its a python interpreter with Scrapy shell URLNote: Make sure to in the same directory where is present, else it will not shell for fetching data from the given page, use selectors. These selectors can be either from CSS or from Xpath. For now, let’s try to fetch all url by using CSS get anchor tag (‘a’)To extract the data:links = (‘a’). extract()For example, links[0] will show something like this:’
GeeksforGeeks‘To get href attribute, use attributes = (‘a::attr(href)’). extract()This will get all the href data which is very useful. Make use of this link and start requesting, let’s create parse method and fetch all the urls and then yield it. Follow that particular URL and fetch more links from that page and this will keep on happening again and again. In short, we are fetching all url present on that, by default, filters those url which has already been visited. So it will not crawl the same url path again. But it’s possible that in two different pages there are two or more than two similar links. For example, in each page, the header link will be available which means that this header link will come in each page request. So try to exclude it by checking parse(self, response): title = (‘title::text’). extract_first() links = (‘a::attr(href)’). extract() for link in links: yield { ‘title’: title, ‘links’: link} if ‘geeksforgeeks’ in link: yield quest(url = link, callback =) Below is the implementation of scraper:import scrapyclass ExtractUrls(): name = “extract” def start_requests(self): for url in urls: yield quest(url = url, callback =) def parse(self, response): title = (‘title::text’). extract() for link in links: yield { ‘title’: title, ‘links’: link} if ‘geeksforgeeks’ in link: yield quest(url = link, callback =) Step 6: In last step, Run the spider and get output in simple json filescrapy crawl NAME_OF_SPIDER -o links. jsonHere, name of spider is “extract” for given example. It will fetch loads of data within few: Note: Scraping any web page is not a legal activity. Don’t perform any scraping operation without permission. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics. To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course
Getting Started With Scrapy - DZone Big Data

Getting Started With Scrapy – DZone Big Data

Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.
However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.
Installation
We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.
Create a working directory and initialize a virtual environment in that directory.
mkdir working
cd working
virtualenv venv. venv/bin/activate
Install scrapy now.
pip install scrapy
Check that it is working. The following display shows the version of scrapy as 1. 4. 0.
scrapy
# prints
Scrapy 1. 0 – no active project
Usage:
scrapy [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)…
Writing a Spider
Scrapy works by loading a Python module called a spider, which is a class inheriting from
Let’s write a simple spider class to load the top posts from Reddit.
To begin with, create a file called and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:
A name identifying the spider.
A start_urls list variable containing the URLs from which to begin crawling.
A parse() method, which can be a no-op as shown.
import scrapy
class redditspider():
name = ‘reddit’
start_urls = [”]
def parse(self, response):
pass
This class can now be executed as follows:
scrapy runspider
# prints…
2017-06-16 10:42:34 [scrapy. middleware] INFO: Enabled item pipelines:
[]
2017-06-16 10:42:34 [] INFO: Spider opened
2017-06-16 10:42:34 [scrapy. extensions. logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)…
Turn Off Logging
As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let’s turn it off for now.
Add these lines to the beginning of the file:
import logging
tLogger(‘scrapy’). setLevel(logging. WARNING)
Now, when we run the spider, we should not see the obfuscating messages.
Parsing the Response
Let’s now parse the response from the scraper. This is done in the method parse(). In this method, we use the method () to perform CSS-style selections on the HTML and extract the required elements.
To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a

.
So we select all from the page and use it to work with further.
for element in (”):
We also implement the following helper methods within the spider class to extract the required text.
The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.
def a(self, response, cssSel):
return ‘ ‘((cssSel). extract())()
And this method extracts text from the first element and returns it.
def f(self, response, cssSel):
return (cssSel). extract_first()
Extracting Required Elements
Once these helper methods are in place, let’s extract the title from each Reddit post. Within, the title is available at >> As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.
def parse(self, resp):
for e in (”):
yield {
‘title’: self. a(e, ‘>>’), }
The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.
In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the list ends.
Running the Spider and Collecting Output
Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).
2017-06-16 11:35:27 [] DEBUG: Scraped from
{‘title’: u’The Plight of a Politician’}
{‘title’: u’Elephants foot compared to humans foot’}…
It is hard to see the real output. Let us redirect the output to a file ().
scrapy runspider -o
And here is a part of…
{“title”: “They got fit together”},
{“title”: “Not all heroes wear capes”},
{“title”: “This sub”},
{“title”: “So I picked this up at a flea market.. “},…
Extract All Required Information
Let’s also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield statement.
def parse(S, r):
‘title’: S. a(e, ‘>>’),
‘votes’: S. f(e, ‘(title)’),
‘subreddit’: S. a(e, ‘>p. tagline>breddit::text’), }
The resulting…
{“votes”: “28962”, “title”: “They got fit together”, “subreddit”: “r/pics”},
{“votes”: “6904”, “title”: “My puppy finally caught his Stub”, “subreddit”: “r/funny”},
{“votes”: “3925”, “title”: “Reddit, please find this woman who went missing during E3! “, “subreddit”: “r/NintendoSwitch”},
{“votes”: “30079”, “title”: “Yo-Yo Skills”, “subreddit”: “r/gifs”},
{“votes”: “2379”, “title”: “For every upvote I won’t smoke for a day”, “subreddit”: “r/stopsmoking”},…
Conclusion
This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.
Topics:
big data,
tutorial,
scrapy,
web scraping,
python

Frequently Asked Questions about how to use scrapy

How do you use a Scrapy tool?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .Nov 8, 2019

How do I get started with Scrapy?

Getting Started With ScrapyInstallation. We use Virtualenv to install scrapy. … Writing a Spider. … Turn Off Logging. … Parsing the Response. … Extracting Required Elements. … Running the Spider and Collecting Output. … Extract All Required Information.Jun 20, 2017

What is Scrapy used for?

Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.Jul 25, 2017

Leave a Reply

Your email address will not be published. Required fields are marked *