Beautiful Soup Vs Scrapy
When should you use Scrapy over BeautifulSoup? | Hexfox
If you’ve found yourself here, you’re probably trying to retrieve some data off
the web to solve a problem – whatever it may be. And if you’re asking this
particular question, then this process is probably familiar to you:
You started by furiously Googling how to do it.
And then discovered it is called “web scraping”.
You then read or got told that Python is a great language to scrape in.
So, armed with that knowledge, you looked for the next step: working out what
the best method is! And that’s when things fall apart with plenty of people that
claim to know best clamouring to tell you which software to use to reach your! I’m Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
You started searching for a solution and yet you’ve ended up with more
freaking problems. With online help, you’ve narrowed it down to these 2
apparently competing routes that are Scrapy & BeautifulSoup but now you’re not
sure which one would be best to learn. Fear not – I will help you make this
decision, right now.
Background
First we need a little background so that we can understand the differences
between them: Scrapy is a fully fledged solution which
allows people to write small amounts of Python code to create a “spider” – an
automated bot which can trawl web pages and scrape them.
BeautifulSoup on the other
hand is a helpful utility that allows a programmer to get specific elements out
of a webpage (for example, a list of images). As such, BeautifulSoup alone is
not enough because you have to actually get the webpage in the first place and
this leads people to using something like
requests or
urllib2 to do that part. These
tools operate kind-of like a web browser and retrieve pages off the internet so
that BeautifulSoup can pluck out the bits a person is after.
So the difference between the two is actually quite large: Scrapy is a tool
specifically created for downloading, cleaning and saving data from the web and
will help you end-to-end; whereas BeautifulSoup is a smaller package which will
only help you get information out of webpages.
So when should I use what?
We’re here to get stuff done. If you are a beginner to the world of web
scraping, I have no hesitation in saying that you should take the Scrapy route.
Scrapy will solve numerous problems for you that you would otherwise have to
handle yourself; in fact I’m willing to bet that it will solve problems for you
that you don’t even know you have yet. “Like what!? ”, you say? Let’s have a
look:
Scrapy enables you to easily post-process any data you find. Data on the web
is a mess! It is very unlikely that the data you find will be in the exact
format that you would like it to be: it may have extra line breaks; funky
styling; extra commas in random places; or simply be in all upper case. Scrapy
will let you handle these cases in a straight forward fashion.
Data can often be incomplete in the wild – if you are writing your own script
you will have to try doubly hard to ensure it is resilient to these cases.
Scrapy will make the process of working around incomplete data much easier for
you.
You will often find when scraping that web pages just blow up in your face:
pages won’t be found, servers will have errors or you could have internet
connectivity issues half way through a large scrape. Scrapy lets you handle
errors gracefully and even has inbuilt ability for resuming a scrape from the
last page it encountered. You get all this for free.
Some websites will be behind a login wall. Scrapy has built in form handling
which you can setup to login to the websites before beginning your scrape.
As a tool built specifically for the task of web scraping, Scrapy provides the
building blocks you need to write sensible spiders. What are sensible spiders?
Spiders that require a minimum amount of maintenance. Individual websites
change their design and layouts on a frequent basis and as we rely on the
layout of the page to extract the data we want – this causes us headaches.
Scrapy separates out the logic so that a simple change in layout doesn’t
result in us having to rewrite out spider from scratch.
Scraping can cause issues for the sites you are targeting; for example,
fetching too many pages at once can put a strain on the target server and take
it offline. This will inevitably result in your spider getting banned for
abuse – so it’s best to be a good citizen on the web. Scrapy allows you to be
one by enabling you to easily throttle the rate at which you are scraping.
Scrapy can do multiple requests at the same time which allows scraping runs to
be much faster. If you are writing a Python script from scratch that tries to
do that, you will likely find that things can go wrong in a horrible million
ways. Scrapy has years of use in actual large organisations that avoid this.
You will see many people recommending other solutions, and they work! But what I
am saying is: they will likely take more effort and thus it will take you longer
to get what you want done. On top of that, you should also be wary of people
suggesting things because:
they likely do not know your full situation and the future plans of the
project you are about to embark upon. What starts as a 20 line script rarely
stays that way!
they do not know your level of knowledge.
people forget easily just how hard it was to learn the thing they are
suggesting.
developers in particular love to suggest solutions to problems that they
themselves would find interesting to solve. This often means not using an “off
the shelf” solution when that would be a much faster way to get the result you
are after. Remember, we only care about the end result here.
When should I use BeautifulSoup & Requests then?
If you know that you won’t need any of the above or any scraping guidance in
general then they are fantastic tools that offer a lot of freedom. For one off
scripts that you don’t plan to maintain in the long run, they are likely the
better solution.
This is the first post of many on the topic of web scraping; if you enjoyed
this – thanks for reading and consider signing up to the mailing list below to
receive more articles like this before they even get posted here.
Difference between BeautifulSoup and Scrapy crawler?
I want to make a website that shows the comparison between amazon and e-bay product price.
Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
kame17. 6k29 gold badges96 silver badges147 bronze badges
asked Oct 30 ’13 at 15:43
Nishant BhaktaNishant Bhakta2, 6472 gold badges18 silver badges23 bronze badges
5
Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch, etc. It is a complete framework for web-scraping or crawling.
While
BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.
In simple words, with Beautiful Soup you can build something similar to Scrapy.
Beautiful Soup is a library while Scrapy is a complete framework.
Source
answered Nov 1 ’13 at 19:41
3
I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page.
After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.
If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.
Take a look at the scrapy documentation, it’s very simple to use.
answered Oct 30 ’13 at 15:49
rdenadairdenadai4132 silver badges10 bronze badges
2
Scrapy
It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.
Feed exports: It basically allows us to save data in various formats like CSV, JSON, jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc. ). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy, user agent, headers etc: scrapy allows us to set and rotate proxy, and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.
etc.
TLDR: scrapy is a framework that provides everything that one might
need to build large scale crawls. It provides various features that
hide complexity of crawling the webs. one can simply start writing web
crawlers without worrying about the setup burden.
Beautiful soup
Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy, You cannot use beautiful soup only to make crawlers. You will need other libraries like requests, urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled, to be crawled, handle cookies, manage proxy, handle errors, create your own functions to push data to CSV, JSON, XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.
To sum up.
Scrapy is a rich framework that you can use to start writing crawlers
without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It
cannot be used alone to scrape web.
You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs, Celery for scheduling crawls) and update the price on your way your website will always pull from the database and crawler and database will act as individual components.
answered Aug 13 ’19 at 9:28
AmitAmit5154 silver badges20 bronze badges
0
Both are using to parse data.
Scrapy:
Scrapy is a fast high-level web crawling and web scraping framework,
used to crawl websites and extract structured data from their pages.
But it has some limitations when data comes from java script or
loading dynamicaly, we can over come it by using packages like splash,
selenium etc.
BeautifulSoup:
Beautiful Soup is a Python library for pulling data out of HTML and
XML files.
we can use this package for getting data from java script or
dynamically loading pages.
Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents
answered Oct 6 ’17 at 8:54
The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.
The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies, etc.
answered Jun 4 ’14 at 14:45
Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method.
Big project takes both advantages.
answered Mar 9 ’18 at 6:20
BeautifulSoup is a library that lets you extract information from a web page.
Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.
You can check this blog to get started with Scrapy
answered Sep 25 ’19 at 21:01
Beautifulsoup is web scraping small library. it does your job but sometime it does not satisfy your needs. i mean if you scrape websites in large amount of data so here in this case beautifulsoup fails.
In this case you should use Scrapy which is a complete scraping framework which will do you job.
Also scrapy has support for databases(all kind of databases) so it is a huge
of scrapy over other web scraping libraries.
answered Mar 4 at 17:21
The differences are many and selection of any tool/technology depends on individual needs.
Few major differences are:
BeautifulSoup is comparatively is easy to learn than Scrapy.
The extensions, support, community is larger for Scrapy than for BeautifulSoup.
Scrapy should be considered as a Spider while BeautifulSoup is a Parser.
answered Feb 23 ’19 at 6:26
krish___nakrish___na5725 silver badges12 bronze badges
Not the answer you’re looking for? Browse other questions tagged python beautifulsoup scrapy web-crawler or ask your own question.
Scrapy VS Beautiful Soup: A Comparison Of Web Crawling …
One of the most critical assets for data-driven organisations is the kind of tools used by their data science professionals. Web crawler and other such web scraping tools are few of those tools that are used to gain meaningful insights. Web scraping allows efficient extraction of data from several web services and helps in converting raw and unstructured data into a structured whole.
There are several tools available for web scraping, such as lxml, BeautifulSoup, MechanicalSoup, Scrapy, Python Requests and others. Among these, Scrapy and Beautiful Soup are popular among developers. Follow us on Google News
In this article, we will compare these two web scraping tools, and try to understand the differences between them. Before diving deep into the tools, let us first understand what these tools are.
Scrapy
Scrapy is an open-source and collaborative framework for extracting the data you need from websites in a fast and simple manner. This tool can be used for extracting data using APIs. It can also be used as a general-purpose web crawler. Thus, Scrapy is an application framework, which can be used for writing web spiders that crawl websites and extract data from them.
The framework provides a built-in mechanism for extracting data – known as selectors – and can be used for data mining, automated testing, etc. Scrapy is supported under Python 3. 5+ under CPython and PyPy starting with PyPy 5. 9.
Features of Scrapy:
Scrapy provides built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressionsAn interactive shell console for trying out the CSS and XPath expressions to scrape dataBuilt-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
Scraping With Scrapy
Using pip
If you just want to install scrapy globally in your system, you can install scrapy library using the python package ‘pip’. Open your terminal or command prompt and type the following command.
pip install scrapy
Using Conda
If you want scrapy to be in your conda environment just type in and execute the following command in your terminal
conda install -c conda-forge scrapy
The scrapy shell: It allows to scrape web pages interactively using the command line.
To open scrapy shell type scrapy shell.
Scraping with Scrapy Shell
Follow the steps below to start scraping:
1. Open the html file in a web browser and copy the url.
2. Now in the scrapy shell type and execute the following command:
fetch(“url–”)
Replace url– with the url of the html file or any webpage and the fetch command will download the page locally to your system.
You will get a similar message in your console
[] DEBUG: Crawled (200)
3. Viewing the response
The fetch object will store whatever page or information it fetched into a response object. To view the response object simply type in and enter the following command.
view(response)
The console will return a True and the webpage that was downloaded with fetch() will open up in your default browser.
4. Now that all the data you need is available locally. You just need to know what data you need.
5. Scraping the data: Coming back to the console, all the elements need to be printed behind the webpage that was fetched earlier. Enter the following command:
print()
Click here to get the detailed web scraping.
Beautiful Soup
Beautiful Soup is one of the most popular Python libraries which helps in parsing HTML or XML documents into a tree structure to find and extract data. This tool features a simple, Pythonic interface and automatic encoding conversion to make it easy to work with website data.
This library provides simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree, and automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
Features of Beautiful Soup:
This Python library provides a few simple methods, as well as Pythonic idioms for navigating, searching, and modifying a parse treeThe library automatically converts incoming and outgoing documents to Unicode and UTF-8, respectivelyThis library sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility
Scraping With Beautifulsoup
Installing Beautiful Soup 4
Beautiful Soup library can be installed using PIP with a very simple command. It is available on almost all platforms. Here is a way to install it using Jupyter Notebook.
This library can be imported with the following code and assign it to an object.
Getting Started
We will be using this basic, and default, HTML doc to parse the data using Beautiful Soup.
The following code will expand HTML into its hierarchy:
Exploring The Parse Tree
To navigate through the tree, we can use the following commands:
Beautiful Soup has many attributes which can be accessed and edited. This extracted parsed data can be saved onto a text file.
Click here to know more about web scraping with BeautifulSoup.
Scrapy VS Beautiful Soup
Structure
Scrapy is an open-source framework, whereas Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. A framework inverts the control of the program and informs the developer what they need. Whereas in the case of a library, the developer calls the library where and when they need it.
Performance
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
Extensibility
Beautiful Soup works best when working on smaller projects. On the other hand, Scrapy may be the better choice for larger projects with more complexities, as this framework can add custom functionalities and can develop pipelines with flexibility and speed.
Beginner-Friendly
For a beginner who is trying hands-on web scraping for the first time, Beautiful Soup is the best choice to start with. Scrapy can be used for scraping, but it is comparatively more complex than the former.
Community
The developer’s community of Scrapy is stronger and vast compared to that of Beautiful Soup. Also, developers can use Beautiful Soup for parsing HTML responses in Scrapy callbacks by feeding the response’s body into a BeautifulSoup object and extracting whatever data they need from it.
Join Our Discord Server. Be part of an engaging online community. Join Here.
Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.
Ambika ChoudhuryA Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Frequently Asked Questions about beautiful soup vs scrapy
Is Scrapy better than BeautifulSoup?
Performance. Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.Apr 8, 2020
Is Scrapy faster than Selenium?
Before coding, you need to estimiate the data size of the extracted data, and the urls need to visit. Scrapy only visit the url you told him, but Selenium will control the browser to visit all js file, css file and img file to render the page, that is why Selenium is much slower than Scrapy when crawling.Jan 2, 2021
Is BeautifulSoup faster than Selenium?
One of the ways to compare selenium vs BeautifulSoup is the performance of both. … This is a con of BeautifulSoup because the programmer needs to know multithreading properly. Scrapy is faster than both as it makes use of asynchronous system calls. So it’s faster and performs better than other libraries.Feb 10, 2021