How To Run Scrapy In Python
How to Run Scrapy From a Script – Towards Data Science
UnsplashedForget about Scrapy’s framework and write it all in a python script that uses is a great framework to use for scraping projects. However, did you know there is a way to run Scrapy straight from a script? Looking at the documentation, there are two ways to run Scrapy. Using the Scrapy API or the this article, you will learnWhy you would use scrapy from a scriptUnderstand the basic script every time you want access scrapy from an individual scriptUnderstand how to specify customised scrapy settingsUnderstand how to specify HTTP requests for scrapy to invokeUnderstand how to process those HTTP responses using scrapy under one Use Scrapy from a Script? Scrapy can be used for a heavy-duty scraping work, however, there are a lot of projects that are quite small and don’t require the need for using the whole scrapy framework. This is where using scrapy in a python script comes in. No need to use the whole framework you can do it all from a python ’s see what the basics of this look like before fleshing out some of the necessary settings to scrape. key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python’s twisted framework is isted is a python framework that is used for input and output processes like HTTP requests for example. Now it does this through what’s called a twister event reactor. Scrapy is built on top of twisted! We won’t go into too much detail here but needless to say, the CrawlerProcess class imports a twisted reactor which listens for events like multiple HTTP requests. This is at the heart of how scrapy awlerProcess assumes that a twisted reactor is NOT used by anything else, like for example another spider. With that, we have the code scrapyfrom awler import CrawlerProcessclass TestSpider(): name = ‘test’if __name__ == “__main__”: process = CrawlerProcess() (TestSpider) ()Now for us to use the scrapy framework, we must create our spider, this is done by creating a class which inherits from is the most basic spider that we must derive from in all scrapy projects. With this, we have to give this spider a name for it to run/ Spiders will require a couple of functions and an URL to scrape but for this example, we will omit this for the you see if __name__ == “__main__”. This is used as a best practice in python. When we write a script you want to it to be able to run the code but also be able to import that code somewhere else. Please see here for further discussion on this instantiate the class CrawlerProcess first to get access to the functions we want. CrawlerProcess has two functions we are interested in, crawl and startWe use crawl to start the spider we created. We then use the start function to start a twisted reactor, the engine that processes and listens to our HTTP requests we want. scrapy framework provides a list of settings that it will use automatically, however for working with the Scrapy API we have to provide the settings explicitly. The settings we define is how we can customise our spiders. The class has a variable called custom_settings. Now this variable can be used to override the settings scrapy automatically uses. We have to create a dictionary of our settings to do this as the custom_settings variable is set to none using may want to use some or most of the settings scrapy provides, in which case you could copy them from there. Alternatively, a list of the built-in settings can be found TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} have shown how to create a spider and define the settings, but we haven’t specified any URLs to scrape, or how we want to specify the requests to the website we want to get data from. For example, parameters, headers and we create spider we also start a method called start_requests(). This will create the requests for any URL we want. Now there are two ways to use this method. 1) By defining the start_urls attribute 2) We implement our function called start_requestsThe shortest way is by defining start_urls. We define it as a list of URLs we want to get, by specifying this variable we automatically use start_requests() to go through each one of our TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} start_urls = [‘URL1’, ‘URL2’]However notice how if we do this, we can’t specify our headers, parameters or anything else we want to go along with the request? This is where implementing our start_requests method comes, we define our variables we want to go along with the request. We then implement our start_requests method so we can make use of the headers and parameters we want, as well as where we want to the response to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params)Here we access the Requests method which when given an URL will make the HTTP requests and return a response defined as the response will notice how we didn’t specify a callback? That is we didn’t specify where scrapy should send the response to the requests we just told it to get for ’s fix that, by default scrapy expects the callback method to be the parse function but it could be anything we want it to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params, callback =) def parse(self, response): print()Here we have defined the function parse which accepts a response variable, remember this is created when we ask scrapy to do the HTTP requests. We then ask scrapy to print the response that, we now have the basics of running scrapy in a python script. We can use all the same methods but we just have to do a bit of configuring might you use the scrapy framework? When is importing scrapy in a python script useful? What does the CrawlerProcess class do? Can you recall the basic script used to start scrapy within a python script? How do you add scrapy settings in your python script? Why might you use a start_requests function instead of start_urls? Please see here for further details about what I’m up to project-wise on my blog and other more tech/coding related content please sign up to my newsletter hereI’d be grateful for any comments or if you want to collaborate or need help with python please do get in you want to get in contact with me, please do so here
How to Run Scrapy From a Script – Towards Data Science
UnsplashedForget about Scrapy’s framework and write it all in a python script that uses is a great framework to use for scraping projects. However, did you know there is a way to run Scrapy straight from a script? Looking at the documentation, there are two ways to run Scrapy. Using the Scrapy API or the this article, you will learnWhy you would use scrapy from a scriptUnderstand the basic script every time you want access scrapy from an individual scriptUnderstand how to specify customised scrapy settingsUnderstand how to specify HTTP requests for scrapy to invokeUnderstand how to process those HTTP responses using scrapy under one Use Scrapy from a Script? Scrapy can be used for a heavy-duty scraping work, however, there are a lot of projects that are quite small and don’t require the need for using the whole scrapy framework. This is where using scrapy in a python script comes in. No need to use the whole framework you can do it all from a python ’s see what the basics of this look like before fleshing out some of the necessary settings to scrape. key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python’s twisted framework is isted is a python framework that is used for input and output processes like HTTP requests for example. Now it does this through what’s called a twister event reactor. Scrapy is built on top of twisted! We won’t go into too much detail here but needless to say, the CrawlerProcess class imports a twisted reactor which listens for events like multiple HTTP requests. This is at the heart of how scrapy awlerProcess assumes that a twisted reactor is NOT used by anything else, like for example another spider. With that, we have the code scrapyfrom awler import CrawlerProcessclass TestSpider(): name = ‘test’if __name__ == “__main__”: process = CrawlerProcess() (TestSpider) ()Now for us to use the scrapy framework, we must create our spider, this is done by creating a class which inherits from is the most basic spider that we must derive from in all scrapy projects. With this, we have to give this spider a name for it to run/ Spiders will require a couple of functions and an URL to scrape but for this example, we will omit this for the you see if __name__ == “__main__”. This is used as a best practice in python. When we write a script you want to it to be able to run the code but also be able to import that code somewhere else. Please see here for further discussion on this instantiate the class CrawlerProcess first to get access to the functions we want. CrawlerProcess has two functions we are interested in, crawl and startWe use crawl to start the spider we created. We then use the start function to start a twisted reactor, the engine that processes and listens to our HTTP requests we want. scrapy framework provides a list of settings that it will use automatically, however for working with the Scrapy API we have to provide the settings explicitly. The settings we define is how we can customise our spiders. The class has a variable called custom_settings. Now this variable can be used to override the settings scrapy automatically uses. We have to create a dictionary of our settings to do this as the custom_settings variable is set to none using may want to use some or most of the settings scrapy provides, in which case you could copy them from there. Alternatively, a list of the built-in settings can be found TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} have shown how to create a spider and define the settings, but we haven’t specified any URLs to scrape, or how we want to specify the requests to the website we want to get data from. For example, parameters, headers and we create spider we also start a method called start_requests(). This will create the requests for any URL we want. Now there are two ways to use this method. 1) By defining the start_urls attribute 2) We implement our function called start_requestsThe shortest way is by defining start_urls. We define it as a list of URLs we want to get, by specifying this variable we automatically use start_requests() to go through each one of our TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} start_urls = [‘URL1’, ‘URL2’]However notice how if we do this, we can’t specify our headers, parameters or anything else we want to go along with the request? This is where implementing our start_requests method comes, we define our variables we want to go along with the request. We then implement our start_requests method so we can make use of the headers and parameters we want, as well as where we want to the response to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params)Here we access the Requests method which when given an URL will make the HTTP requests and return a response defined as the response will notice how we didn’t specify a callback? That is we didn’t specify where scrapy should send the response to the requests we just told it to get for ’s fix that, by default scrapy expects the callback method to be the parse function but it could be anything we want it to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params, callback =) def parse(self, response): print()Here we have defined the function parse which accepts a response variable, remember this is created when we ask scrapy to do the HTTP requests. We then ask scrapy to print the response that, we now have the basics of running scrapy in a python script. We can use all the same methods but we just have to do a bit of configuring might you use the scrapy framework? When is importing scrapy in a python script useful? What does the CrawlerProcess class do? Can you recall the basic script used to start scrapy within a python script? How do you add scrapy settings in your python script? Why might you use a start_requests function instead of start_urls? Please see here for further details about what I’m up to project-wise on my blog and other more tech/coding related content please sign up to my newsletter hereI’d be grateful for any comments or if you want to collaborate or need help with python please do get in you want to get in contact with me, please do so here
Command line tool — Scrapy 2.5.1 documentation
Scrapy is controlled through the scrapy command-line tool, to be referred
here as the “Scrapy tool” to differentiate it from the sub-commands, which we
just call “commands” or “Scrapy commands”.
The Scrapy tool provides several commands, for multiple purposes, and each one
accepts a different set of arguments and options.
(The scrapy deploy command has been removed in 1. 0 in favor of the
standalone scrapyd-deploy. See Deploying your project. )
Configuration settings¶
Scrapy will look for configuration parameters in ini-style files
in standard locations:
/etc/ or c:scrapy (system-wide),
~/ ($XDG_CONFIG_HOME) and ~/ ($HOME)
for global (user-wide) settings, and
inside a Scrapy project’s root (see next section).
Settings from these files are merged in the listed order of preference:
user-defined values have higher priority than system-wide defaults
and project-wide settings will override all others, when defined.
Scrapy also understands, and can be configured through, a number of environment
variables. Currently these are:
SCRAPY_SETTINGS_MODULE (see Designating the settings)
SCRAPY_PROJECT (see Sharing the root directory between projects)
SCRAPY_PYTHON_SHELL (see Scrapy shell)
Default structure of Scrapy projects¶
Before delving into the command-line tool and its sub-commands, let’s first
understand the directory structure of a Scrapy project.
Though it can be modified, all Scrapy projects have the same file
structure by default, similar to this:
myproject/
spiders/…
The directory where the file resides is known as the project
root directory. That file contains the name of the python module that defines
the project settings. Here is an example:
[settings]
default = ttings
Sharing the root directory between projects¶
A project root directory, the one that contains the, may be
shared by multiple Scrapy projects, each with its own settings module.
In that case, you must define one or more aliases for those settings modules
under [settings] in your file:
project1 = ttings
project2 = ttings
By default, the scrapy command-line tool will use the default settings.
Use the SCRAPY_PROJECT environment variable to specify a different project
for scrapy to use:
$ scrapy settings –get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
Project 2 Bot
Using the scrapy tool¶
You can start by running the Scrapy tool with no arguments and it will print
some usage help and the available commands:
Scrapy X. Y – no active project
Usage:
scrapy
Available commands:
crawl Run a spider
fetch Fetch a URL using the Scrapy downloader
[… ]
The first line will print the currently active project if you’re inside a
Scrapy project. In this example it was run from outside a project. If run from inside
a project it would have printed something like this:
Scrapy X. Y – project: myproject
Creating projects¶
The first thing you typically do with the scrapy tool is create your Scrapy
project:
scrapy startproject myproject [project_dir]
That will create a Scrapy project under the project_dir directory.
If project_dir wasn’t specified, project_dir will be the same as myproject.
Next, you go inside the new project directory:
And you’re ready to use the scrapy command to manage and control your
project from there.
Controlling projects¶
You use the scrapy tool from inside your projects to control and manage
them.
For example, to create a new spider:
scrapy genspider mydomain
Some Scrapy commands (like crawl) must be run from inside a Scrapy
project. See the commands reference below for more
information on which commands must be run from inside projects, and which not.
Also keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider-overridden behaviours (such as the user_agent attribute to override
the user-agent) if the url being fetched is associated with some specific
spider. This is intentional, as the fetch command is meant to be used to
check how spiders are downloading pages.
Available tool commands¶
This section contains a list of the available built-in commands with a
description and some usage examples. Remember, you can always get more info
about each command by running:
And you can see all available commands with:
There are two kinds of commands, those that only work from inside a Scrapy
project (Project-specific commands) and those that also work without an active
Scrapy project (Global commands), though they may behave slightly different
when running from inside a project (as they would use the project overridden
settings).
Global commands:
startproject
genspider
settings
runspider
shell
fetch
view
version
Project-only commands:
crawl
check
list
edit
parse
bench
startproject¶
Syntax: scrapy startproject
Requires project: no
Creates a new Scrapy project named project_name, under the project_dir
directory.
If project_dir wasn’t specified, project_dir will be the same as project_name.
Usage example:
$ scrapy startproject myproject
genspider¶
Syntax: scrapy genspider [-t template]
Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The
$ scrapy genspider -l
Available templates:
basic
csvfeed
xmlfeed
$ scrapy genspider example
Created spider ‘example’ using template ‘basic’
$ scrapy genspider -t crawl scrapyorg
Created spider ‘scrapyorg’ using template ‘crawl’
This is just a convenience shortcut command for creating spiders based on
pre-defined templates, but certainly not the only way to create spiders. You
can just create the spider source code files yourself, instead of using this
command.
crawl¶
Syntax: scrapy crawl
Requires project: yes
Start crawling using a spider.
Usage examples:
$ scrapy crawl myspider
[… myspider starts crawling… ]
check¶
Syntax: scrapy check [-l]
Run contract checks.
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
$ scrapy check
[FAILED] first_spider:parse_item
>>> ‘RetailPricex’ field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0.. 4
list¶
Syntax: scrapy list
List all available spiders in the current project. The output is one spider per
line.
$ scrapy list
spider1
spider2
edit¶
Syntax: scrapy edit
Edit the given spider using the editor defined in the EDITOR environment
variable or (if unset) the EDITOR setting.
This command is provided only as a convenience shortcut for the most common
case, the developer is of course free to choose any tool or IDE to write and
debug spiders.
fetch¶
Syntax: scrapy fetch
Downloads the given URL using the Scrapy downloader and writes the contents to
standard output.
The interesting thing about this command is that it fetches the page how the
spider would download it. For example, if the spider has a USER_AGENT
attribute which overrides the User Agent, it will use that one.
So this command can be used to “see” how your spider would fetch a certain page.
If used outside a project, no particular per-spider behaviour would be applied
and it will just use the default Scrapy downloader settings.
Supported options:
–spider=SPIDER: bypass spider autodetection and force use of specific spider
–headers: print the response’s HTTP headers instead of the response’s body
–no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
$ scrapy fetch –nolog [… html content here… ]
$ scrapy fetch –nolog –headers {‘Accept-Ranges’: [‘bytes’],
‘Age’: [‘1263 ‘],
‘Connection’: [‘close ‘],
‘Content-Length’: [‘596’],
‘Content-Type’: [‘text/html; charset=UTF-8’],
‘Date’: [‘Wed, 18 Aug 2010 23:59:46 GMT’],
‘Etag’: [‘”573c1-254-48c9c87349680″‘],
‘Last-Modified’: [‘Fri, 30 Jul 2010 15:30:18 GMT’],
‘Server’: [‘Apache/2. 2. 3 (CentOS)’]}
view¶
Syntax: scrapy view
Opens the given URL in a browser, as your Scrapy spider would “see” it.
Sometimes spiders see pages differently from regular users, so this can be used
to check what the spider “sees” and confirm it’s what you expect.
$ scrapy view [… browser starts… ]
shell¶
Syntax: scrapy shell [url]
Starts the Scrapy shell for the given URL (if given) or empty if no URL is
given. Also supports UNIX-style local file paths, either relative with. / or.. / prefixes or absolute file paths.
See Scrapy shell for more info.
-c code: evaluate the code in the shell, print the result and exit
–no-redirect: do not follow HTTP 3xx redirects (default is to follow them);
this only affects the URL you may pass as argument on the command line;
once you are inside the shell, fetch(url) will still follow HTTP redirects by default.
$ scrapy shell [… scrapy shell starts… ]
$ scrapy shell –nolog -c ‘(, )’
(200, ”)
# shell follows HTTP redirects by default
# you can disable this with –no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell –no-redirect –nolog -c ‘(, )’
(302, ”)
parse¶
Syntax: scrapy parse
Fetches the given URL and parses it with the spider that handles it, using the
method passed with the –callback option, or parse if not given.
–a NAME=VALUE: set spider argument (may be repeated)
–callback or -c: spider method to use as callback for parsing the
response
–meta or -m: additional request meta that will be passed to the callback
request. This must be a valid json string. Example: –meta=’{“foo”: “bar”}’
–cbkwargs: additional keyword arguments that will be passed to the callback.
This must be a valid json string. Example: –cbkwargs=’{“foo”: “bar”}’
–pipelines: process items through pipelines
–rules or -r: use CrawlSpider
rules to discover the callback (i. e. spider method) to use for parsing the
–noitems: don’t show scraped items
–nolinks: don’t show extracted links
–nocolour: avoid using pygments to colorize the output
–depth or -d: depth level for which the requests should be followed
recursively (default: 1)
–verbose or -v: display information for each depth level
–output or -o: dump scraped items to a file
New in version 2. 3.
$ scrapy parse -c parse_item
[… scrapy log lines crawling spider… ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': 'Example item',
'category': 'Furniture',
'length': '12 cm'}]
# Requests -----------------------------------------------------------------
[]
settings¶
Syntax: scrapy settings [options]
Get the value of a Scrapy setting.
If used inside a project it’ll show the project setting value, otherwise it’ll
show the default Scrapy value for that setting.
Example usage:
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
runspider¶
Syntax: scrapy runspider <>
Run a spider self-contained in a Python file, without having to create a
project.
$ scrapy runspider
[… spider starts crawling… ]
version¶
Syntax: scrapy version [-v]
Prints the Scrapy version. If used with -v it also prints Python, Twisted
and Platform info, which is useful for bug reports.
bench¶
Syntax: scrapy bench
Run a quick benchmark test. Benchmarking.
Custom project commands¶
You can also add your custom project commands by using the
COMMANDS_MODULE setting. See the Scrapy commands in
scrapy/commands for examples on how to implement your commands.
COMMANDS_MODULE¶
Default: ” (empty string)
A module to use for looking up custom Scrapy commands. This is used to add custom
commands for your Scrapy project.
Example:
COMMANDS_MODULE = ‘mmands’
Register commands via entry points¶
You can also add Scrapy commands from an external library by adding a
mmands section in the entry points of the library
file.
The following example adds my_command command:
from setuptools import setup, find_packages
setup(name=’scrapy-mymodule’,
entry_points={
‘mmands’: [
‘mmands:MyCommand’, ], }, )
Frequently Asked Questions about how to run scrapy in python
How do I run a scrapy python?
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python’s twisted framework is imported.May 29, 2020
How do I run scrapy in terminal?
You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y – no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader […]
How do you run a scrapy spider?
In order to run a spider using the PyCharm terminal you can do:Open the PyCharm project.Open terminal dialog – ALT + F12.Navigate in terminal to spider file (you can check the image below)Start spider with command. just for running and getting output in terminal window – scrapy runspider CoinMarketCap.py.Oct 31, 2019