Jsdom Web Scraping
Web Scraping and Parsing HTML in Node.js with jsdom – Twilio
The internet has a wide variety of information for human consumption. But this data is often difficult to access programmatically if it doesn’t come in the form of a dedicated REST API. With tools like jsdom, you can scrape and parse this data directly from web pages to use for your projects and applications.
Let’s use the example of needing MIDI data to train a neural network that can generate classic Nintendo-sounding music. In order to do this, we’ll need a set of MIDI music from old Nintendo games. Using jsdom we can scrape this data from the Video Game Music Archive.
Getting started and setting up dependencies
Before moving on, you will need to make sure you have an up to date version of and npm installed.
Navigate to the directory where you want this code to live and run the following command in your terminal to create a package for this project:
The –yes argument runs through all of the prompts that you would otherwise have to fill out or skip. Now we have a for our app.
For making HTTP requests to get data from the web page we will use the Got library, and for parsing through the HTML we’ll use Cheerio.
Run the following command in your terminal to install these libraries:
npm install got@10. 4. 0 jsdom@16. 2. 2
jsdom is a pure-JavaScript implementation of many web standards, making it a familiar tool to use for lots of JavaScript developers. Let’s dive into how to use it.
Using Got to retrieve data to use with jsdom
First let’s write some code to grab the HTML from the web page, and look at how we can start parsing through it. The following code will send a GET request to the web page we want, and will create a jsdom object with the HTML from that page, which we’ll name dom:
const fs = require(‘fs’);
const got = require(‘got’);
const jsdom = require(“jsdom”);
const { JSDOM} = jsdom;
const vgmUrl= ”;
got(vgmUrl)(response => {
const dom = new JSDOM();
((‘title’). textContent);})(err => {
(err);});
When you pass the JSDOM constructor a string, you will get back a JSDOM object, from which you can access a number of usable properties such as window. As seen in this code, you can navigate through the HTML and retrieve DOM elements for the data you want using a query selector.
For example, querySelector(‘title’). textContent will get you the text inside of the
Using CSS Selectors with jsdom
If you want to get more specific in your query, there are a variety of selectors you can use to parse through the HTML. Two of the most common ones are to search for elements by class or ID. If you wanted to get a div with the ID of “menu” you would use querySelectorAll(‘#menu’) and if you wanted all of the header columns in the table of VGM MIDIs, you’d do querySelectorAll(”)
What we want on this page are the hyperlinks to all of the MIDI files we need to download. We can start by getting every link on the page using querySelectorAll(‘a’). Add the following to your code in
(‘a’). forEach(link => {
();});})(err => {
This code logs the URL of every link on the page. We’re able to look through all elements from a given selector using the forEach function. Iterating through every link on the page is great, but we’re going to need to get a little more specific than that if we want to download all of the MIDI files.
Filtering through HTML elements
Before writing more code to parse the content that we want, let’s first take a look at the HTML that’s rendered by the browser. Every web page is different, and sometimes getting the right data out of them requires a bit of creativity, pattern recognition, and experimentation.
Our goal is to download a bunch of MIDI files, but there are a lot of duplicate tracks on this webpage, as well as remixes of songs. We only want one of each song, and because our ultimate goal is to use this data to train a neural network to generate accurate Nintendo music, we won’t want to train it on user-created remixes.
When you’re writing code to parse through a web page, it’s usually helpful to use the developer tools available to you in most modern browsers. If you right-click on the element you’re interested in, you can inspect the HTML behind that element to get more insight.
You can write filter functions to fine-tune which data you want from your selectors. These are functions which loop through all elements for a given selector and return true or false based on whether they should be included in the set or not.
If you looked through the data that was logged in the previous step, you might have noticed that there are quite a few links on the page that have no href attribute, and therefore lead nowhere. We can be sure those are not the MIDIs we are looking for, so let’s write a short function to filter those out as well as elements which do contain a href element that leads to a file:
const isMidi = (link) => {
// Return false if there is no href attribute.
if(typeof === ‘undefined’) { return false}
return (”);};
Now we have the problem of not wanting to download duplicates or user generated remixes. For this we can use regular expressions to make sure we are only getting links whose text has no parentheses, as only the duplicates and remixes contain parentheses:
const noParens = (link) => {
// Regular expression to determine if the text has parentheses.
const parensRegex = /^((?! (). )*$/;
return (link. textContent);};
Try adding these to your code in by creating an array out of the collection of HTML Element Nodes that are returned from querySelectorAll and applying our filter functions to it:
// Create an Array out of the HTML Elements for filtering using spread syntax.
const nodeList = [(‘a’)];
(isMidi)(noParens). forEach(link => {
Run this code again and it should only be printing files, without duplicates of any particular song.
Downloading the MIDI files we want from the webpage
Now that we have working code to iterate through every MIDI file that we want, we have to write code to download all of them.
In the callback function for looping through all of the MIDI links, add this code to stream the MIDI download into a local file, complete with error checking:
const fileName =;
(`${vgmUrl}/${fileName}`)
(‘error’, err => { (err); (`Error on ${vgmUrl}/${fileName}`)})
(eateWriteStream(`MIDIs/${fileName}`))
(‘finish’, () => (`Downloaded: ${fileName}`));});
Run this code from a directory where you want to save all of the MIDI files, and watch your terminal screen display all 2230 MIDI files that you downloaded (at the time of writing this). With that, we should be finished scraping all of the MIDI files we need.
Go through and listen to them and enjoy some Nintendo music!
The vast expanse of the World Wide Web
Now that you can programmatically grab things from web pages, you have access to a huge source of data for whatever your projects need. One thing to keep in mind is that changes to a web page’s HTML might break your code, so make sure to keep everything up to date if you’re building applications on top of this. You might want to also try comparing the functionality of the jsdom library with other solutions by following tutorials for web scraping using Cheerio and headless browser scripting using Puppeteer or a similar library called Playwright.
If you’re looking for something to do with the data you just grabbed from the Video Game Music Archive, you can try using Python libraries like Magenta to train a neural network with it.
I’m looking forward to seeing what you build. Feel free to reach out and share your experiences or ask any questions.
Email:
Twitter: @Sagnewshreds
Github: Sagnew
Twitch (streaming live code): Sagnewshreds
Web scraping for web developers: a concise summary
Knowing one approach to web scraping may solve your problem in the short term, but all methods have their own strengths and weaknesses. Being aware of this can save you time and help you to solve a task more merous resources exist, which will show you a single technique for extracting data from a web page. The reality is that multiple solutions and tools can be used for are your options to programmatically extract data from a web page? What are the pros and cons of each approach? How to use cloud services to increase the degree of automation? This guide meant to answer these questions. I assume you have a basic understanding of browsers in general, HTTP requests, the DOM (Document Object Model), HTML, CSS selectors, and Async these phrases sound unfamiliar, I suggest checking out those topics before continue reading. Examples are implemented in, but hopefully you can transfer the theory into other languages if contentHTML sourceLet’s start with the simplest you are planning to scrape a web page, this is the first method to try. It requires a negligible amount of computing power and the least time to ever, it only works if the HTML source code contains the data you are targeting. To check that in Chrome, right-click the page and choose View page source. Now you should see the HTML source ’s important to note here, that you won’t see the same code by using Chrome’s inspect tool, because it shows the HTML structure related to the current state of the page, which is not necessarily the same as the source HTML document that you can get from the you find the data here, write a CSS selector belonging to the wrapping element, to have a reference later implement, you can send an HTTP GET request to the URL of the page and will get back the HTML source Node, you can use a tool called CheerioJS to parse this raw HTML and extract the data using a selector. The code looks something like this:const fetch = require(‘node-fetch’);
const cheerio = require(‘cheerio’);
const url = ”;
const selector = ‘. example’;
fetch(url)
(res => ())
(html => {
const $ = (html);
const data = $(selector);
(());});Dynamic contentIn many cases, you can’t access the information from the raw HTML code, because the DOM was manipulated by some JavaScript, executed in the background. A typical example of that is a SPA (Single Page Application), where the HTML document contains a minimal amount of information, and the JavaScript populates it at this situation, a solution is to build the DOM and execute the scripts located in the HTML source code, just like a browser does. After that, the data can be extracted from this object with selectors. Headless browsersThis can be achieved by using a headless browser. A headless browser is almost the same thing as the normal one you are probably using every day but without a user interface. It’s running in the background and you can programmatically control it instead of clicking with your mouse and typing with a keyboard. A popular choice for a headless browser is Puppeteer. It is an easy to use Node library which provides a high-level API to control Chrome in headless mode. It can be configured to run non-headless, which comes in handy during development. The following code does the same thing as before, but it will work with dynamic pages as well:const puppeteer = require(‘puppeteer’);
async function getData(url, selector){
const browser = await ();
const page = await wPage();
await (url);
const data = await page. evaluate(selector => {
return document. querySelector(selector). innerText;}, selector);
await ();
return data;}
getData(url, selector)
(result => (result));Of course, you can do more interesting things with Puppeteer, so it is worth checking out the documentation. Here is a code snippet which navigates to a URL, takes a screenshot and saves it:const puppeteer = require(‘puppeteer’);
async function takeScreenshot(url, path){
await reenshot({path: path});
await ();}
const path = ”;
takeScreenshot(url, path);As you can imagine, running a browser requires much more computing power than sending a simple GET request and parsing the response. Therefore execution is relatively costly and slow. Not only that but including a browser as a dependency makes the deployment package the upside, this method is highly flexible. You can use it for navigating around pages, simulating clicks, mouse moves, and keyboard events, filling out forms, taking screenshots or generating PDFs of pages, executing commands in the console, selecting elements to extract its text content. Basically, everything can be done that is possible manually in a ing just the DOMYou may think it’s a little bit of overkill to simulate a whole browser just for building a DOM. Actually, it is, at least under certain is a Node library, called Jsdom, which will parse the HTML you pass it, just like a browser does. However, it isn’t a browser, but a tool for building a DOM from a given HTML source code, while also executing the JavaScript code within that to this abstraction, Jsdom is able to run faster than a headless browser. If it’s faster, why don’t use it instead of headless browsers all the time? Quote from the documentation:People often have trouble with asynchronous script loading when using jsdom. Many pages load scripts asynchronously, but there is no way to tell when they’re done doing so, and thus when it’s a good time to run your code and inspect the resulting DOM structure. This is a fundamental limitation. … This can be worked around by polling for the presence of a specific solution is shown in the example. It checks every 100 ms if the element either appeared or timed out (after 2 seconds) also often throws nasty error messages when some browser feature in the page is not implemented by Jsdom, such as: “Error: Not implemented: …” or “Error: Not implemented: rollTo…”. This issue also can be solved with some workarounds (virtual consoles). Generally, it’s a lower level API than Puppeteer, so you need to implement certain things things make it a little messier to use, as you will see in the example. Puppeteer solves all these things for you behind the scenes and makes it extremely easy to use. Jsdom for this extra work will offer a fast and lean ’s see the same example as previously, but with Jsdom:const jsdom = require(“jsdom”);
const { JSDOM} = jsdom;
async function getData(url, selector, timeout) {
const virtualConsole = new rtualConsole();
(console, { omitJSDOMErrors: true});
const dom = await omURL(url, {
runScripts: “dangerously”,
resources: “usable”,
virtualConsole});
const data = await new Promise((res, rej)=>{
const started = ();
const timer = setInterval(() => {
const element = (selector)
if (element) {
res(element. textContent);
clearInterval(timer);}
else if(()-started > timeout){
rej(“Timed out”);
clearInterval(timer);}}, 100);});
();
const url = “;
const selector = “. example”;
getData(url, selector, 2000)(result => (result));Reverse engineeringJsdom is a fast and lightweight solution, but it’s possible even further to simplify we even need to simulate the DOM? Generally speaking, the webpage that you want to scrape consists of the same HTML, same JavaScript, same technologies you’ve already know. So, if you find that piece of code from where the targeted data was derived, you can repeat the same operation in order to get the same we oversimplify things, the data you’re looking for can be:part of the HTML source code (as we saw in the first paragraph), part of a static file, referenced in the HTML document (for example a string in a javascript file), a response for a network request (for example some JavaScript code sent an AJAX request to a server, which responded with a JSON string) of these data sources can be accessed with network requests. From our perspective, it doesn’t matter if the webpage uses HTTP, WebSockets or any other communication protocol, because all of them are reproducible in you locate the resource housing the data, you can send a similar network request to the same server as the original page does. As a result, you get the response, containing the targeted data, which can be easily extracted with regular expressions, string methods, etc…With simple words, you can just take the resource where the data is located, instead of processing and loading the whole stuff. This way the problem, showed in the previous examples, can be solved with a single HTTP request instead of controlling a browser or a complex JavaScript solution seems easy in theory, but most of the times it can be really time-consuming to carry out and requires some experience of working with web pages and servers. A possible place to start researching is to observe network traffic. A great tool for that is the Network tab in Chrome DevTools. You will see all outgoing requests with the responses (including static files, AJAX requests, etc…), so you can iterate through them and look for the can be even more sluggish if the response is modified by some code before being rendered on the screen. In that case, you have to find that piece of code and understand what’s going you see, this solution may require way more work than the methods featured so far. On the other hand, once it’s implemented, it provides the best chart shows the required execution time, and the package size compared to Jsdom and Puppeteer:These results aren’t based on precise measurements and can vary in every situation, but shows well the approximate difference between these service integrationLet’s say you implemented one of the solutions listed so far. One way to execute your script is to power on your computer, open a terminal and execute it can become annoying and inefficient very quickly, so it would be better if we could just upload the script to a server and it would execute the code on a regular basis depending on how it’s can be done by running an actual server and configuring some rules on when to execute the script. Servers shine when you keep observing an element in a page. In other cases, a cloud function is probably a simpler way to functions are basically containers intended to execute the uploaded code when a triggering event occurs. This means you don’t have to manage servers, it’s done automatically by the cloud provider of your choice. A possible trigger can be a schedule, a network request, and numerous other events. You can save the collected data in a database, write it in a Google sheet or send it in an email. It all depends on your creativity. Popular cloud providers are Amazon Web Services(AWS), Google Cloud Platform(GCP), and Microsoft Azure and all of them has a function service:AWS LambdaGCP Cloud FunctionsAzure FunctionsThey offer some amount of free usage every month, which your single script probably won’t exceed, unless in extreme cases, but please check the pricing before you are using Puppeteer, Google’s Cloud Functions is the simplest solution. Headless Chrome’s zipped package size (~130MB) exceeds AWS Lambda’s limit of maximum zipped size (50MB). There are some techniques to make it work with Lambda, but GCP functions support headless Chrome by default, you just need to include Puppeteer as a dependency in you want to learn more about cloud functions in general, do some research on serverless architectures. Many great guides have already been written on this topic and most providers have an easy to follow mmaryI know that every topic was a bit compressed. You probably can’t implement every solution just with this knowledge, but with the documentation and some custom research, it shouldn’t be a problem. Hopefully, now you have a high-level overview of techniques used for collecting data from the web, so you can dive deeper into each topic accordingly.
Learn to code for free. freeCodeCamp’s open source curriculum has helped more than 40, 000 people get jobs as developers. Get started
Simple Site Scraping With NodeJS And JSDom – Shane Reustle
Simple Site Scraping With NodeJS And JSDom – Shane Reustle
Shane Reustle
I’ve been playing with Node on and off over the past couple of weeks and it’s really starting to grow on me. I initially looked into it because I’m intrigued by the thought of using one language for both client and server side coding. Turns out, as people have pointed out, it’s fast too. Really fast. I spent some time messing around with the hello world examples, built some simple APIs, and even gave a talk at BarCamp Boston about the basics of Node, but I want to do something that takes advantage of the JS nature of Node. Let’s start off with a simple site scraping example where we pull the current temperature. As the examples get more complex, we’ll be able to leverage libraries like jQuery to do more complex scraping in an already familiar syntax.
For this first example, you’re going to need the Node packages Request and JSDOM. You can get both of these using npm (npm install request jsdom). This is a pretty short example (9 lines), so I’ll skip right to the code.
var request = require(‘request’);
var jsdom = require(‘jsdom’);
var req_url = ”;
request({uri: req_url}, function(error, response, body){
if(! error && atusCode == 200){
var window = (body). createWindow();
var temp = tElementsByClassName(‘u-eng’)[0]. innerHTML;
(temp);}});
We started off by requiring Request and JSDOM. We then made the request to the site we’re going to scrape and set a callback function to handle the response. Inside that callback, we make sure the request was successful by checking the HTTP status code. If the request was successful, we pipe the response into JSDOM to render a duplicate version of the DOM locally so that we can interact with it. Now that we have a local copy of the webpage, we can do whatever we want with it. We only need 1 line of code to extract the current temperature from the page, which we send back to the console for the user to see.
After playing around with this script for awhile, you may have noticed this method works well but often throws strange JS errors depending on what site you try to scrape. Let’s take a step back and think about what we’re doing. We make a request to a page and parse a copy of its response html locally. The problem with this method is that there are usually resources requested using relative links (/static/) and not absolute links (). Since these resources do not exist locally, they cannot be loaded, which ends up causing errors later on in the page. We can manually go in and patch up these broken links by either modifying the links in the response before parsing it, or including the scripts in your new DOM before parsing the response. Keep in mind, there may be AJAX requests that use relative links as well, so keep an eye on the network traffic.
This should give you a good head start on scraping sites with NodeJS and JSDOM. JSDOM does a great job with this task, but doesn’t seem to be built for this type of work. If you need to scrape some JS generated content, you may need to do some work. If you have a large site scraping project, you may want to check out NodeIO and PhantomJS. NodeIO is a screen scraping framekwork built on top of Node and PhantomJS is a headless implementation of WebKit with a JS API. I would use PhantomJS if I needed to do any large scraping projects because it lets you interact with a real browser which renders all of the JS content. Keep an eye out for a review of PhantomJS in the future.
Frequently Asked Questions about jsdom web scraping
Is web scraping legal?
It is perfectly legal if you scrape data from websites for public consumption and use it for analysis. However, it is not legal if you scrape confidential information for profit. For example, scraping private contact information without permission, and sell them to a 3rd party for profit is illegal.Aug 16, 2021
What is Jsdom?
JSDOM is a library which parses and interacts with assembled HTML just like a browser. The benefit is that it isn’t actually a browser. Instead, it implements web standards like browsers do. You can feed it some HTML, and it will parse that HTML.Aug 13, 2021
Can you use JavaScript to web scrape?
js, JavaScript is a great language to use for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the DOM with front-end JavaScript.