• April 30, 2024

Cheerio Node Js Example

Cheerio tutorial – web scraping in JavaScript with … – ZetCode

last modified July 7, 2020
Cheerio tutorial shows how to do web scraping in JavaScript with Cheerio
module. Cheerio implements the core of jQuery designed for the server.
Cheerio
Cheerio is a fast, flexible, and lean implementation of core
jQuery designed specifically for the server.
In this tutorial we scrape HTML from a local web server. For the local
web server, we use the local-web-server.






Home page



My website

I am a JavaScript programmer.

My hobbies are:

  • Swimming
  • Tai Chi
  • Running
  • Web development
  • Reading
  • Music




We will be working with this HTML file.
Cheerio selectors
In Cherrion, we use selectors to select tags of an HTML document.
The selector syntax was borrowed from jQuery.
The following is a partial list of available selectors:
$(“*”) — selects all elements
$(“#first”) — selects the element with id=”first”
$(“”) — selects all elements with class=”intro”
$(“div”) — selects all

elements
$(“h2, div, p”) — selects all

,

,

elements
$(“li:first”) — selects the first

  • element
    $(“li:last”) — selects the last

  • element
    $(“li:even”) — selects all even

  • elements
    $(“li:odd”) — selects all odd

  • elements
    $(“:empty”) — selects all elements that are empty
    $(“:focus”) — selects the element that currently has focus
    Installing Cheerio and other modules
    We install cheerio module and two additional modules.
    $ nodejs -v
    v9. 11. 2
    We use Node version 9. 2.
    $ sudo npm i cheerio
    $ sudo npm i request
    $ sudo npm i -g local-web-server
    We install cheerio, request, and
    local-web-server.
    $ ws
    Serving at t400:8000,,
    Inside the project directory, where we have the
    file, we start the local web server. It automatically serves
    the file on three different locations.
    Cheerio title
    In the first example, we get the title of the document.
    const cheerio = require(‘cheerio’);
    const request = require(‘request’);
    request({
    method: ‘GET’,
    url: ‘localhost:8000’}, (err, res, body) => {
    if (err) return (err);
    let $ = (body);
    let title = $(‘title’);
    (());});
    The example prints the title of the HTML document.
    We include cheerio and request modules.
    With cheerio, we do web scraping. With request,
    we create GET requests.
    We create a GET request to the localhost which is served by our
    local web server. The resource is available in the body
    parameter.
    First, we load the HTML document. To mimic jQuery, we use the
    $ variable.
    The selector returns the title tag.
    (());
    With the text() method, we get the text of the title tag.
    $ node
    Home page
    The example prints the title of the document.
    Cheerio get parent element
    The parent element is retrieved with parent().
    let h1El = $(‘h1’);
    let parentEl = ();
    ((0). tagName)});
    We get the parent of the h1 element.
    main
    The parent element of h1 is main.
    Cheerio first & last element
    The first element of a cheerio object can be found with first(),
    the last element with last().
    let main = $(‘main’);
    let fel = ildren()();
    let lel = ildren()();
    ((0). tagName);
    ((0). tagName);});
    The example prints the first and last element of the main
    tag.
    We select the main tag.
    We get the first and the last element from the main children.
    We find out the tag names.
    h1
    ul
    The first tag of the main is h1, the last
    one is ul.
    Cheerio add element
    The append() method adds a new element at the end
    of the specified tag.
    let ulEl = $(‘ul’);
    (‘

  • Travel
  • ‘);
    let lis = $(‘ul’)();
    let items = (‘n’);
    rEach((e) => {
    if (e) {
    (place(/(s+)/g, ”));}});});
    In the example, we add a new list item to the ul element and
    print it to the console.
    We append a new hobby.
    We get the HTML of the ul tag.
    (place(/(s+)/g, ”));}});
    We strip white spaces. Text data of elements contains lots of
    space.

  • TaiChi
  • Webdevelopment
  • Travel
  • A new travel hobby was appended at the end of the list.
    Cheerio insert after element
    With after(), we can insert an element after a tag.
    $(‘main’)(‘

    This is a footer

    ‘)
    ($());});
    In the example, we insert a footer element after
    the main element.
    Cheerio loop over elements
    With each(), we can loop over elements.
    let hobbies = [];
    $(‘li’)(function (i, e) {
    hobbies[i] = $(this)();});
    (hobbies);});
    The example loops over li tags of the ul
    and prints the text of the elements in an array.
    [ ‘Swimming’,
    ‘Tai Chi’,
    ‘Running’,
    ‘Web development’,
    ‘Reading’,
    ‘Music’]
    This is the output.
    Cheerio get element attributes
    Attributes can be retrieved with attr() function.
    let fpEl = $(‘h1 + p’);
    let attrs = ();
    (attrs);});
    In the example, we get the attributes of the paragraph that is
    the immediate sibling of h1.
    { class: ‘fpar’}
    The paragraph contains the fpar class.
    Cheerio filter elements
    We can use filter() to apply a filter on the elements.
    let allEls = $(‘*’);
    let filteredEls = (function (i, el) {
    // this === el
    return $(this). children() > 3;});
    let items = ();
    rEach(e => {
    ();});});
    In the example, we find out all elements of the document that contain
    more than three children.
    The * selector selects all elements.
    On the retrieved elements, we apply a filter. An element is included
    in the filtered list only if it contains more than three children.
    ();});
    We go through the filtered list and print the names of the elements.
    head
    The head, main, and ul elements
    contain more than three children. The body is not included
    because it contains only one immediate child.
    In this tutorial, we have done web scraping in JavaScript with
    Cheerio library.
    List all JavaScript tutorials.
    cheerio

    cheerio

    Fast, flexible & lean implementation of core jQuery designed specifically for the server.
    中文文档 (Chinese Readme)
    const cheerio = require(‘cheerio’);const $ = (‘

    Hello world

    ‘);$(”)(‘Hello there! ‘);$(‘h2’). addClass(‘welcome’);$();//=>

    Hello there!


    Note
    We are currently working on the 1. 0. 0 release of cheerio on the main branch. The source code for the last published version, 0. 22. 0, can be found here.
    Installation
    npm install cheerio
    Features
    ❤ Familiar syntax:
    Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
    ϟ Blazingly fast:
    Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient.
    ❁ Incredibly flexible:
    Cheerio wraps around parse5 parser and can optionally use @FB55’s forgiving htmlparser2. Cheerio can parse nearly any HTML or XML document.
    Cheerio is not a web browser
    Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single page application). This makes Cheerio much, much faster than other solutions. If your use case requires any of this functionality, you should consider browser automation software like Puppeteer and Playwright or DOM emulation projects like JSDom.
    API
    Markup example we’ll be using:

    • Apple
    • Orange
    • Pear

    This is the HTML markup we will be using in all of the API examples.
    Loading
    First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.
    This is the preferred method:
    // ES6 or TypeScript:import * as cheerio from ‘cheerio’;// In other environments:const cheerio = require(‘cheerio’);const $ = (‘

    ‘);$();//=>


    Similar to web browser contexts, load will introduce , , and elements if they are not already present. You can set load’s third argument to false to disable this.
    const $ = (‘

    ‘, null, false);$();//=> ‘


    Optionally, you can also load in the HTML by passing the string as the context:
    $(‘ul’, ‘

    ‘);
    Or as the root:
    $(‘li’, ‘ul’, ‘

    ‘);
    If you need to modify parsing options for XML input, you may pass an extra
    object to ():
    const $ = (‘

    ‘, { xml: { normalizeWhitespace: true, }, });
    The options in the xml object are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well. When xml is set, the default options are:
    { xmlMode: true, decodeEntities: true, // Decode HTML entities. withStartIndices: false, // Add a `startIndex` property to nodes. withEndIndices: false, // Add an `endIndex` property to nodes. }
    For a full list of options and their effects, see domhandler and
    htmlparser2’s options.
    Using htmlparser2
    Cheerio ships with two parsers, parse5 and htmlparser2. The
    former is the default for HTML, the latter the default for XML.
    Some users may wish to parse markup with the htmlparser2 library, and
    traverse/manipulate the resulting structure with Cheerio. This may be the case
    for those upgrading from pre-1. 0 releases of Cheerio (which relied on
    htmlparser2), for those dealing with invalid markup (because htmlparser2 is
    more forgiving), or for those operating in performance-critical situations
    (because htmlparser2 may be faster in some cases). Note that “more forgiving”
    means htmlparser2 has error-correcting mechanisms that aren’t always a match
    for the standards observed by web browsers. This behavior may be useful when
    parsing non-HTML content.
    To support these cases, load also accepts a htmlparser2-compatible data
    structure as its first argument. Users may install htmlparser2, use it to
    parse input, and pass the result to load:
    // Usage as of htmlparser2 version 6:const htmlparser2 = require(‘htmlparser2’);const dom = rseDocument(document, options);const $ = (dom);
    If you want to save some bytes, you can use Cheerio’s slim export, which
    always uses htmlparser2:
    const cheerio = require(‘cheerio/lib/slim’);
    Selectors
    Cheerio’s selector implementation is nearly identical to jQuery’s, so the API is very similar.
    $( selector, [context], [root])
    selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root is typically the HTML document string.
    This selector method is the starting point for traversing and manipulating the document. Like jQuery, it’s the primary method for selecting elements in the document.
    $(”, ‘#fruits’)();//=> Apple$(‘ul ‘)(‘class’);//=> pear$(‘li[class=orange]’)();//=> Orange
    XML Namespaces
    You can select with XML Namespaces but due to the CSS specification, the colon (:) needs to be escaped for the selector to be valid.
    $(‘[xml\:id=”main”‘);
    Rendering
    When you’re ready to render the document, you can call the html method on the “root” selection:
    $()();//=> // // //

      //

    • Apple
    • //

    • Orange
    • //

    • Pear
    • //

    // //
    If you want to render the outerHTML of a selection, you can use the html utility functon:
    ($(”));//=>

  • Pear
  • You may also render the text content of a Cheerio object using the text static method:
    const $ = (‘This is content. ‘);($(‘body’));//=> This is content.
    Plugins
    Once you have loaded a document, you may extend the prototype or the equivalent fn property with custom plugin methods:
    const $ = (‘Hello, world! ‘);$. prototype. logHtml = function () { (());};$(‘body’). logHtml(); // logs “Hello, world! ” to the console
    If you’re using TypeScript, you should add a type definition for your new method:
    declare module ‘cheerio’ { interface Cheerio { logHtml(this: Cheerio): void;}}
    The “DOM Node” object
    Cheerio collections are made up of objects that bear some resemblance to browser-based DOM nodes. You can expect them to define the following properties:
    tagName
    parentNode
    previousSibling
    nextSibling
    nodeValue
    firstChild
    childNodes
    lastChild
    Screencasts
    This video tutorial is a follow-up to Nettut’s “How to Scrape Web Pages with and jQuery”, using cheerio instead of JSDOM + jQuery. This video shows how easy it is to use cheerio and how much faster cheerio is than JSDOM + jQuery.
    Cheerio in the real world
    Are you using cheerio in production? Add it to the wiki!
    Sponsors
    Does your company use Cheerio in production? Please consider sponsoring this project! Your help will allow maintainers to dedicate more time and resources to its development and support.
    Backers
    Become a backer to show your support for Cheerio and help us maintain and improve this open source project.
    Special Thanks
    This library stands on the shoulders of some incredible developers. A special thanks to:
    • @FB55 for node-htmlparser2 & CSSSelect:
    Felix has a knack for writing speedy parsing engines. He completely re-wrote both @tautologistic’s node-htmlparser and @harry’s node-soupselect from the ground up, making both of them much faster and more flexible. Cheerio would not be possible without his foundational work
    • @jQuery team for jQuery:
    The core API is the best of its class and despite dealing with all the browser inconsistencies the code base is extremely clean and easy to follow. Much of cheerio’s implementation and documentation is from jQuery. Thanks guys.
    • @visionmedia:
    The style, the structure, the open-source”-ness” of this library comes from studying TJ’s style and using many of his libraries. This dude consistently pumps out high-quality libraries and has always been more than willing to help or answer questions. You rock TJ.
    License
    MIT
    How to Scrape Websites with Node.js and Cheerio

    How to Scrape Websites with Node.js and Cheerio

    There might be times when a website has data you want to analyze but the site doesn’t expose an API for accessing those data.
    To get the data, you’ll have to resort to web scraping.
    In this article, I’ll go over how to scrape websites with and Cheerio.
    Before we start, you should be aware that there are some legal and ethical issues you should consider before scraping a site. It’s your responsibility to make sure that it’s okay to scrape a site before doing so.
    The sites used in the examples throughout this article all allow scraping, so feel free to follow along.
    Prerequisites
    Here are some things you’ll need for this tutorial:
    You need to have installed. If you don’t have Node, just make sure you download it for your system from the downloads page
    You need to have a text editor like VSCode or Atom installed on your machine
    You should have at least a basic understanding of JavaScript,, and the Document Object Model (DOM). But you can still follow along even if you are a total beginner with these technologies. Feel free to ask questions on the freeCodeCamp forum if you get stuck
    What is Web Scraping?
    Web scraping is the process of extracting data from a web page. Though you can do web scraping manually, the term usually refers to automated data extraction from websites – Wikipedia.
    What is Cheerio?
    Cheerio is a tool for parsing HTML and XML in, and is very popular with over 23k stars on GitHub.
    It is fast, flexible, and easy to use. Since it implements a subset of JQuery, it’s easy to start using Cheerio if you’re already familiar with JQuery.
    According to the documentation, Cheerio parses markup and provides an API for manipulating the resulting data structure but does not interpret the result like a web browser.
    The major difference between cheerio and a web browser is that cheerio does not produce visual rendering, load CSS, load external resources or execute JavaScript. It simply parses markup and provides an API for manipulating the resulting data structure. That explains why it is also very fast – cheerio documentation.
    If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like axios or node-fetch among others.
    How to Scrape a Web Page in Node Using Cheerio
    In this section, you will learn how to scrape a web page using cheerio. It is important to point out that before scraping a website, make sure you have permission to do so – or you might find yourself violating terms of service, breaching copyright, or violating privacy.
    In this example, we will scrape the ISO 3166-1 alpha-3 codes for all countries and other jurisdictions as listed on this Wikipedia page. It is under the Current codes section of the ISO 3166-1 alpha-3 page.
    This is what the list of countries/jurisdictions and their corresponding codes look like:
    You can follow the steps below to scrape the data in the above list.
    Step 1 – Create a Working Directory
    In this step, you will create a directory for your project by running the command below on the terminal. The command will create a directory called learn-cheerio. You can give it a different name if you wish.
    mkdir learn-cheerio
    You should be able to see a folder named learn-cheerio created after successfully running the above command.
    In the next step, you will open the directory you have just created in your favorite text editor and initialize the project.
    Step 2 – Initialize the Project
    In this step, you will navigate to your project directory and initialize the project. Open the directory you created in the previous step in your favorite text editor and initialize the project by running the command below.
    npm init -y
    Successfully running the above command will create a file at the root of your project directory.
    In the next step, you will install project dependencies.
    Step 3 – Install Dependencies
    In this step, you will install project dependencies by running the command below. This will take a couple of minutes, so just be patient.
    npm i axios cheerio pretty
    Successfully running the above command will register three dependencies in the file under the dependencies field. The first dependency is axios, the second is cheerio, and the third is pretty.
    axios is a very popular client which works in node and in the browser. We need it because cheerio is a markup parser.
    For cheerio to parse the markup and scrape the data you need, we need to use axios for fetching the markup from the website. You can use another HTTP client to fetch the markup if you wish. It doesn’t necessarily have to be axios.
    pretty is npm package for beautifying the markup so that it is readable when printed on the terminal.
    In the next section, you will inspect the markup you will scrape data from.
    Step 4 – Inspect the Web Page You Want to Scrape
    Before you scrape data from a web page, it is very important to understand the HTML structure of the page.
    In this step, you will inspect the HTML structure of the web page you are going to scrape data from.
    Navigate to ISO 3166-1 alpha-3 codes page on Wikipedia. Under the “Current codes” section, there is a list of countries and their corresponding codes. You can open the DevTools by pressing the key combination CTRL + SHIFT + I on chrome or right-click and then select “Inspect” option.
    This is what the list looks like for me in chrome DevTools:
    In the next section, you will write code for scraping the web page.
    Step 5 – Write the Code to Scrape the Data
    In this section, you will write code for scraping the data we are interested in. Start by running the command below which will create the file.
    touch
    Successfully running the above command will create an file at the root of the project directory.
    Like any other Node package, you must first require axios, cheerio, and pretty before you start using them. You can do so by adding the code below at the top of the file you have just created.
    const axios = require(“axios”);
    const cheerio = require(“cheerio”);
    const pretty = require(“pretty”);
    Before we write code for scraping our data, we need to learn the basics of cheerio. We’ll parse the markup below and try manipulating the resulting data structure. This will help us learn cheerio syntax and its most common methods.
    The markup below is the ul element containing our li elements.
    const markup = `

    • Mango
    • Apple

    `;
    Add the above variable declaration to the file
    How to Load Markup in Cheerio
    You can load markup in cheerio using the method. The method takes the markup as an argument. It also takes two more optional arguments. You can read more about them in the documentation if you are interested.
    Below, we are passing the first and the only required argument and storing the returned value in the $ variable. We are using the $ variable because of cheerio’s similarity to Jquery. You can use a different variable name if you wish.
    Add the code below to your file:
    const $ = (markup);
    (pretty($()));
    If you now execute the code in your file by running the command node on the terminal, you should be able to see the markup on the terminal. This is what I see on my terminal:
    How to Select an Element in Cheerio
    Cheerio supports most of the common CSS selectors such as the class, id, and element selectors among others. In the code below, we are selecting the element with class fruits__mango and then logging the selected element to the console. Add the code below to your file.
    const mango = $(“. fruits__mango”);
    (()); // Mango
    The above lines of code will log the text Mango on the terminal if you execute using the command node
    How to Get the Attribute of an Element in Cheerio
    You can also select an element and get a specific attribute such as the class, id, or all the attributes and their corresponding values.
    const apple = $(“. fruits__apple”);
    ((“class”)); //fruits__apple
    The above code will log fruits__apple on the terminal. fruits__apple is the class of the selected element.
    How to Loop Through a List of Elements in Cheerio
    Cheerio provides the method for looping through several selected elements.
    Below, we are selecting all the li elements and looping through them using the method. We log the text content of each list item on the terminal.
    Add the code below to your file.
    const listItems = $(“li”);
    (); // 2
    (function (idx, el) {
    ($(el)());});
    // Mango
    // Apple
    The above code will log 2, which is the length of the list items, and the text Mango and Apple on the terminal after executing the code in
    How to Append or Prepend an Element to a Markup in Cheerio
    Cheerio provides a method for appending or prepending an element to a markup.
    The append method will add the element passed as an argument after the last child of the selected element. On the other hand, prepend will add the passed element before the first child of the selected element.
    const ul = $(“ul”);
    (“

  • Banana
  • “);
    epend(“

  • Pineapple
  • “);
    After appending and prepending elements to the markup, this is what I see when I log $() on the terminal:
    Those are the basics of cheerio that can get you started with web scraping.
    To scrape the data we described at the beginning of this article from Wikipedia, copy and paste the code below in the file:
    // Loading the dependencies. We don’t need pretty
    // because we shall not log html to the terminal
    const fs = require(“fs”);
    // URL of the page we want to scrape
    const url = “;
    // Async function which scrapes the data
    async function scrapeData() {
    try {
    // Fetch HTML of the page we want to scrape
    const { data} = await (url);
    // Load HTML we fetched in the previous line
    const $ = (data);
    // Select all the list items in plainlist class
    const listItems = $(“. plainlist ul li”);
    // Stores data for all countries
    const countries = [];
    // Use method to loop through the li we selected
    ((idx, el) => {
    // Object holding data for each country/jurisdiction
    const country = { name: “”, iso3: “”};
    // Select the text content of a and span elements
    // Store the textcontent in the above object
    = $(el). children(“a”)();
    o3 = $(el). children(“span”)();
    // Populate countries array with country data
    (country);});
    // Logs countries array to the console
    (countries);
    // Write countries array in file
    fs. writeFile(“”, ringify(countries, null, 2), (err) => {
    if (err) {
    (err);
    return;}
    (“Successfully written data to file”);});} catch (err) {
    (err);}}
    // Invoke the above function
    scrapeData();
    Do you understand what is happening by reading the code? If not, I’ll go into some detail now. I have also made comments on each line of code to help you understand.
    In the above code, we require all the dependencies at the top of the file and then we declared the scrapeData function. Inside the function, the markup is fetched using axios. The fetched HTML of the page we need to scrape is then loaded in cheerio.
    The list of countries/jurisdictions and their corresponding iso3 codes are nested in a div element with a class of plainlist. The li elements are selected and then we loop through them using the method. The data for each country is scraped and stored in an array.
    After running the code above using the command node, the scraped data is written to the file and printed on the terminal. This is part of what I see on my terminal:
    Conclusion
    Thank you for reading this article and reaching the end! We have covered the basics of web scraping using cheerio. You can head over to the cheerio documentation if you want to dive deeper and fully understand how it works.
    Feel free to ask questions on the freeCodeCamp forum if there is anything you don’t understand in this article.
    Finally, remember to consider the ethical concerns as you learn web scraping.
    Learn to code for free. freeCodeCamp’s open source curriculum has helped more than 40, 000 people get jobs as developers. Get started

    Frequently Asked Questions about cheerio node js example

    What is Cheerio in NodeJS?

    Cheerio is a Node. js library that helps developers interpret and analyze web pages using a jQuery-like syntax.Nov 24, 2018

    Is Cheerio good for web scraping?

    According to the documentation, Cheerio parses markup and provides an API for manipulating the resulting data structure but does not interpret the result like a web browser. … If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like axios or node-fetch among others.Jul 19, 2021

    How do I scrape data from a website using node JS?

    Steps Required for Web ScrapingCreating the package.json file.Install & Call the required libraries.Select the Website & Data needed to Scrape.Set the URL & Check the Response Code.Inspect & Find the Proper HTML tags.Include the HTML tags in our Code.Cross-check the Scraped Data.Oct 27, 2020

    Leave a Reply

    Your email address will not be published. Required fields are marked *