Scrapy With Selenium
selenium with scrapy for dynamic page – Stack Overflow
If (url doesn’t change between the two pages) then you should add dont_filter=True with your quest() or scrapy will find this url as a duplicate after processing first page.
If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser
But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.
Here is an example:
class ScrollScraper(Spider):
name = “scrollingscraper”
quote_url = ”
start_urls = [quote_url + “1”]
def parse(self, response):
quote_item = QuoteItem()
print
data = ()
for item in (‘quotes’, []):
quote_item[“author”] = (‘author’, {})(‘name’)
quote_item[‘quote’] = (‘text’)
quote_item[‘tags’] = (‘tags’)
yield quote_item
if data[‘has_next’]:
next_page = data[‘page’] + 1
yield Request(self. quote_url + str(next_page))
When pagination url is same for every pages & uses POST request then you can use rmRequest() instead of quest(), both are same but FormRequest adds a new argument (formdata=) to the constructor.
Here is another spider example form this post:
class SpiderClass():
# spider name and all
name = ‘ajax’
page_incr = 1
start_urls = [”]
pagination_url = ”
sel = Selector(response)
if ge_incr > 1:
json_data = ()
sel = Selector((‘content’, ”))
# your code here
# pagination code starts here
if (‘//div[@class=”panel-wrapper”]’):
ge_incr += 1
formdata = {
‘sorter’: ‘recent’,
‘location’: ‘main loop’,
‘loop’: ‘main loop’,
‘action’: ‘sort’,
‘view’: ‘grid’,
‘columns’: ‘3’,
‘paginated’: str(ge_incr),
‘currentquery[category_name]’: ‘reviews’}
yield FormRequest(gination_url, formdata=formdata, )
else:
return
Scraping Javascript Enabled Websites using Scrapy-Selenium
Scrapy-selenium is a middleware that is used in web scraping. scrapy do not support scraping modern sites that uses javascript frameworks and this is the reason that this middleware is used with scrapy to scrape those modern provide the functionalities of selenium that help in working with javascript websites. Other advantages provided by this is driver by which we can also see what is happening behind the scenes. As selenium is automated tool it also provides us to how to deal with input tags and scrape according to what you pass in input field. Passing inputs in input fields became easier by using time scrapy-selenium was introduced in 2018 and its an opensource. The alternative to this can be scrapy-splashInstall and Setup Scrapy – Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics. To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level CourseInstall scrapyRunscrapy startproject projectname (projectname is name of project)Now, let’s Run, scrapy genspider spidername (replace spidername with your preferred spider name and with website that you want to scrape). Note: Later also url can be changed, inside your scrapy spider:Integrating scrapy-selenium in scrapy project:Install scrapy-selenium and add this in your filefrom shutil import whichSELENIUM_DRIVER_NAME = ‘firefox’SELENIUM_DRIVER_EXECUTABLE_PATH = which(‘geckodriver’)SELENIUM_DRIVER_ARGUMENTS=[‘-headless’] from shutil import whichSELENIUM_DRIVER_NAME = ‘chrome’SELENIUM_DRIVER_EXECUTABLE_PATH = which(‘chromedriver’)SELENIUM_DRIVER_ARGUMENTS=[‘–headless’] DOWNLOADER_MIDDLEWARES = { ‘leniumMiddleware’: 800}In this project chrome driver is driver is to be downloaded according to version of chrome browser. Go to help section in your chrome browser then click about Google chrome and check your wnload chrome driver from website as referred here To download chrome driverWhere to add chromedriver:Addition in file:Change to be made in spider file:To run project:command- scrapy crawl spidername (scrapy crawl integratedspider in this project)
spider code before scrapy-selenium:import scrapyclass IntegratedspiderSpider(): name = ‘integratedspider’ allowed_domains = [”] def parse(self, response): passImportant fields in scrapy-selenium:name- name is a variable where name of spider is written and each spider is recognizedby this name. The command to run spider is, scrapy crawl spidername (Here spidername isreferred to that name which is defined in the spider). function start_requests- The first requests to perform are obtained by calling the start_requests() method which generates Request for the URL specified in the url field in yield SeleniumRequest and the parse method as callback function for the Requestsurl- Here url of the site is reenshot- You can take a screenshot of a web page with the method get_screenshot_as_file() with as parameter the filename and screenshot will save in llback- The function that will be called with the response of this request as its first nt_filter- indicates that this request should not be filtered by the scheduler. if same url is send to parse it will not give exception of same url already accessed. What it means is same url can be accessed more then fault value is false. wait_time- Scrapy doesn’t wait a fixed amount of time between requests. But by this field we can assign it during neral structure of scrapy-selenium spider:import scrapyfrom scrapy_selenium import SeleniumRequestclass IntegratedspiderSpider(): name = ‘integratedspider’ def start_requests(self): yield SeleniumRequest( wait_time = 3, screenshot = True, callback =, dont_filter = True) def parse(self, response): passProject of Scraping with scrapy-selenium:scraping online courses names from geeksforgeeks site using scrapy-seleniumGetting X-path of element we need to scrap –Code to scrap Courses Data from Geeksforgeeks –import scrapyfrom scrapy_selenium import SeleniumRequestclass IntegratedspiderSpider(): name = ‘integratedspider’ def start_requests(self): yield SeleniumRequest( wait_time = 3, screenshot = True, callback =, dont_filter = True) def parse(self, response): courses = (‘//*[@id =”active-courses-content”]/div/div/div’) for course in courses: course_name = (‘. //a/div[2]/div/div[2]/h4/text()’)() course_name = (‘n’)[1] course_name = () yield { ‘course Name’:course_name}Output –
Selenium vs Scrapy: Which One Should You Choose for Web …
Web scraping is a technique for extracting data from an online source. It provides you with structured data that can be stored in any format. This data can then be used in AI and ML algorithms. Web scraping can provide you with large volumes of clean data that are optimal for these algorithms.
There are various tools and libraries that can be used for web scraping. In this article we will focus on two of the most popular web scraping frameworks: Selenium and Scrapy. We will analyze both frameworks and then we will see which one is the best choice for your web scraping needs.
Selenium consists of a set of powerful tools and libraries that automates web browser actions. In layman’s terms this means that Selenium provides tools that can interact with browsers and can automate browser actions like click, input, select, and navigate etc. with the help of APIs and scripts. This capability can be used for testing web applications, including cross-browser testing. Selenium supports Safari, Firefox, Chrome and Internet Explorer. Developed and released as an open source tool in 2004, Selenium is widely used by many companies worldwide.
Selenium for Web Scraping
You must be wondering: how can a test automation tool be used for web scraping? Selenium has a webdriver component that provides web scraping features. There are various methods and objects in Selenium WebDriver that are useful for web scraping. There are:
1. ge_source
This method returns the HTML code of the page.
2.
Gives the title of the page.
3. rrent_url
Used to get the current URL of the page.
4. Find_elements
Get a list of specific elements on a page. You can find an element by its name, class_name, tag, id, xpath.
5. Web_Element
To get particular data from HTML elements, Web_Element is used., Web_Element, click(), t_attribute(), nd_keys() are few useful features in Web_Element
6. Is_displayed()
A method used to find out if an element is present on a page. It returns true if an element is present and vice versa.
Selenium Advantages
Free and open source
Provides multi-browser support
Supports Linux, Windows and MAC OS
Multiple language support like Java, c#, Python, Kotlin, Ruby, Javascript
Selenium Disadvantages
Selenium WebDriver occupies system resources even for small data set
The scraping process begins once page is fully loaded so it is slow in terms of processing
For each browser you need to install a WebDriver component
Scrapy is a web scraping and web crawling framework designed to get structured data from websites. However, Scrapy can also be used for monitoring and automated testing web applications. Scrapy was developed in 2008 by “” and is written entirely in Python. Scrapy provides an asynchronous mechanism which processes multiple requests in parallel.
Scrapy for Web Scraping: Features
Here’s a list of the main built-in Scrapy features that make it a powerful web scraping tool:
1. Spiders
Spiders are classes that define a set of instructions to scrape a particular website. These built-in customized classes provide an efficient approach for web scraping.
2. Selectors
Selectors in scrapy are used to select parts of an HTML document defined by XPath or CSS expressions. With selectors you can use regular expressions through the re() method.
3. Items
Data extracted through spiders is returned as items. The itemadapter library supports the following items: attrs objects, dictionaries, item object, data class object.
4. Item Pipeline
A python class that validates, cleans and stores the scraped data in a database. In addition to this it also checks for duplicates.
5. Requests and Responses
Requests are generated from the spider that takes the request to the end point, where the request is executed and the response object takes the issued request to spider.
6. Link Extractors
A powerful feature that extracts links from responses.
Scrapy Built-in Services
Scrapy also provides following built-in services to automate tasks when scraping:
Logging
Stats collection
Sending emails
Telnet console
Web service
Scrapy Advantages
Scrapy can extract data in different formats such as CSV, XML and JSON.
Scrapy provides AutoThrottle features that automatically adjust the tool to the ideal crawling speed.
Scrapy is asynchronous so it can load several pages in parallel.
Large volumes of data can be extracted
In terms of speed, Scrapy is fast
Scrapy consumes little memory and CPU space
Scrapy Disadvantages
Scrapy cannot handle Javascript
The installation process varies for different operating systems
Scrapy requires Python version 2. 7. +
When it comes to selecting only one library, Selenium or Scrapy, the decision ultimately boils down to the nature of the use cases. Each library has its own pros and cons. Selenium is primarily a web automation tool, however, Selenium WebDrivers can also be used to scrape data from websites, if you’re already using it or you’re scraping a JS website. On the other hand, Scrapy is a powerful web-scraping framework that can be used for scraping huge volumes of data from different websites.
Let’s see some examples about when to choose each:
Data Volumes
Let’s say we are working on a project where we need large volumes of data from different websites. To scrape those websites we have to make multiple calls using proxies and VPNs. In addition to this we need a robust mechanism and we can’t afford delays. In such scenarios, Scrapy is an ideal choice. Using Scrapy you can easily work with proxies and VPNs. It can pull large volumes of data since it is a specialized web scraping framework.
JavaScript Support
To scrape data from a website that uses Javascript, Selenium is a better approach. However, you can use Scrapy to scrape JavaScript-based websites through the Splash library.
Performance
Scrapy is asynchronous, it executes multiple requests simultaneously. Even if a request fails or any errors happen the incoming requests aren’t affected. This improves the overall speed efficiency of the process. Selenium is also robust but in case of large data volume the overall process is slow.
Selenium
Medium-low
JS support
Robust, slow with high data volume
Scrapy
High
JS support via Splash
Fast
To conclude the above discussion I would say that both Selenium and Scrapy are powerful tools. The nature of work for which they’re originally developed is different from one another. Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better. I hope you got a clear understanding of Selenium vs. Scrapy and you are ready for your next project.
To learn more about using Selenium, check out this webinar.
Frequently Asked Questions about scrapy with selenium
Can I use Selenium with Scrapy?
Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage’s source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.Aug 6, 2020
How do you integrate selenium with Scrapy?
Integrating scrapy-selenium in scrapy project:Install scrapy-selenium and add this in your settings.py file. … In this project chrome driver is used.Chrome driver is to be downloaded according to version of chrome browser. … Where to add chromedriver:Addition in settings.py file:Change to be made in spider file:More items…•Sep 5, 2020
Is Scrapy better than selenium?
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.Oct 4, 2021