Javascript Web Crawler
is it possible to write web crawler in javascript? – Stack Overflow
I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page
asked Jun 18 ’12 at 13:04
1
Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.
If the page running the crawler script is on, then that script can crawl all the pages on, but not the pages of any other origin (unless some edge case applies, e. g., the Access-Control-Allow-Origin header is set for pages on the other server).
If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you’ll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that’s probably not an issue. )
answered Jun 18 ’12 at 13:19
apsillersapsillers105k15 gold badges210 silver badges231 bronze badges
3
We could crawl the pages using Javascript from server side with help of headless webkit. For crawling, we have few libraries like PhantomJS, CasperJS, also there is a new wrapper on PhantomJS called Nightmare JS which make the works easier.
answered Mar 30 ’15 at 13:55
Arun Arun 1114 bronze badges
Google’s Chrome team has released puppeteer on August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).
It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:
const browser = await ({executablePath: ‘/path/to/Chrome’});
An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):
const puppeteer = require(‘puppeteer’);
(async () => {
const browser = await ();
const page = await wPage();
await (”);
await reenshot({path: ”});
await ();})();
answered Nov 2 ’17 at 17:13
Natan StreppelNatan Streppel5, 5926 gold badges33 silver badges43 bronze badges
There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend’s friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.
Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hijacked and used as a public proxy by 3rd party if you are not carefull.
Good luck, maybe you make a friend or two in the process like I did:-)
answered May 13 ’14 at 7:48
TomTom4966 silver badges15 bronze badges
My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.
Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom’s powers)
answered Oct 29 ’13 at 14:57
Maciej JankowskiMaciej Jankowski2, 6343 gold badges23 silver badges33 bronze badges
0
answered Jul 3 ’14 at 14:26
yes it is possible
Use NODEJS (its server side JS)
There is NPM (package manager that handles 3rd party modules) in nodeJS
Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)
answered Apr 8 ’15 at 14:30
hfarazmhfarazm1, 28616 silver badges21 bronze badges
There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.
here an example:
if you have urls like these:
then you can use something like this to open all pages containing product list(execute this manually)
var j = 0;
for(var i=1;i<5;i++)
{
setTimeout(function(){
j = j + 1;
(' + j, '_blank');}, 15000 * i);}
then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.
and then a script for each product page to extract data and call a webservice passing data and close window and so on.
answered Sep 22 '15 at 9:42
farhang67farhang676931 gold badge11 silver badges25 bronze badges
I made an example javascript crawler on github.
It's event driven and use an in-memory queue to store all the resources(ie. urls).
How to use in your node environment
var Crawler = require('.. /lib/crawler')
var crawler = new Crawler('');
// xDepth = 4;
// awlInterval = 10;
// xListenerCurrency = 10;
// disQueue = true;
();
Here I'm just showing you 2 core method of a javascript crawler.
= function() {
var crawler = this;
xtTick(() => {
//the run loop
awlerIntervalId = setInterval(() => {
();}, awlInterval);
//kick off first one
();});
nning = true;
(‘start’);}
if (crawler. _openRequests >= xListenerCurrency) return;
//go get the item
((err, queueItem, index) => {
if (queueItem) {
//got the item start the fetch
crawler. fetchQueueItem(queueItem, index);} else if (crawler. _openRequests === 0) {
((err, completeCount) => {
if (err)
throw err;
((err, length) => {
if (length === completeCount) {
//no open Request, no unfetcheditem stop the crawler
(“complete”, completeCount);
clearInterval(awlerIntervalId);
nning = false;}});});}});};
Here is the github link It is a javascript web crawler written under 1000 lines of code.
This should put you on the right track.
answered Oct 28 ’16 at 20:31
Fan JinFan Jin2, 18513 silver badges25 bronze badges
You can make a web crawler driven from a remote json file that opens all links from a page in new tabs as soon as each tab loads except ones that have already been opened. If you set up a with a browser extension running in a basic browser (nothing runs except the web browser and an internet config program) and had it shipped and installed somewhere with good internet, you could make a database of webpages with an old computer. That would just need to retrieve the content of each tab. You could do that for about $2000, contrary to most estimates for search engine costs. You’d just need to basically make your algorithm provide pages based on how much a term appears in the innerText property of the page, keywords, and description. You could also set up another PC to recrawl old pages from the one-time database and add more. I’d estimate it would take about 3 months and $20000, maximum.
answered Dec 5 ’20 at 1:08
AnonymousAnonymous3391 silver badge11 bronze badges
Not the answer you’re looking for? Browse other questions tagged javascript web-crawler or ask your own question.
Building a simple web crawler with Node.js and JavaScript
All of us use search engines almost daily. When most of us talk about search engines, we really mean the World Wide Web search engines. A very superficial overview of a search engine suggests that a user enters a search term and the search engine gives a list of all relevant resources related to that search term. But, to provide the user with a list of resources the search engine should know that they exist on the Internet.
Search engines do not magically know what websites exist on the Internet. So this is exactly where the role of web crawlers comes into the picture.
What is a web crawler?
A web crawler is a bot that downloads and indexes content from all over the Internet. The aim of such a bot is to get a brief idea about or know what every webpage on the Internet is about so that retrieving the information becomes easy when needed.
The web crawler is like a librarian organizing the books in a library making a catalog of those books so that it becomes easier for the reader to find a particular book. To categorize a book, the librarian needs to read its topic, summary, and some part of the content if required.
How does a web crawler work?
It is very difficult to know how many webpages exist on the Internet in total. A web crawler starts with a certain number of known URLs and as it crawls that webpage, it finds links to other webpages. A web crawler follows certain policies to decide what to crawl and how frequently to crawl.
Which webpages to crawl first is also decided by considering some parameters. For instance, webpages with a lot of visitors are a good option to start with, and that a search engine has it indexed.
Building a simple web crawler with and JavaScript
We will be using the modules cheerio and request.
Install these dependencies using the following commands
npm install –save cheerio
npm install –save request
Enter fullscreen mode
Exit fullscreen mode
The following code imports the required modules and makes a request to Hacker News.
We log the status code of the response to the console to see if the request was successful.
var request = require(‘request’);
var cheerio = require(‘cheerio’);
var fs = require(‘fs’);
request(“, (error, response, body) => {
if(error) {
(“Error: ” + error);}
(“Status code: ” + atusCode);});
Note that the fs module is used to handle files and it is a built-in module.
We observe the structure of the data using the developer tools of our browser. We see that there are tr elements with athing class.
We will go through all the elements and get the title of the post by selecting the child element and the hyperlink by selecting the a element. This task is accomplished by adding the following code after the of the previous code block
var $ = (body);
$(‘(links)’)(function( index) {
var title = $(this)(‘ > a’)()();
var link = $(this)(‘ > a’)(‘href’);
endFileSync(”, title + ‘\n’ + link + ‘\n’);});
We also skip the posts related to hiring (if observed carefully, we see that the element of such post does not have a child links element.
The complete code now looks like following
request(“, function(error, response, body) {
(“Status code: ” + atusCode);
endFileSync(”, title + ‘\n’ + link + ‘\n’);});});
The data is stored in a file named
Your simple web crawler is ready!!
References
Node.js web scraping tutorial – LogRocket Blog
Editor’s note: This web scraping tutorial was last updated on 28 February 2021.
In this web scraping tutorial, we’ll demonstrate how to build a web crawler in to scrape websites and stores the retrieved data in a Firebase database. Our web crawler will perform the web scraping and data transfer using worker threads.
Here’s what we’ll cover:
What is a web crawler?
What is web scraping in
workers: The basics
Communicating with worker threads
Building a web crawler
Web scraping in
Using worker threads for web scraping
Is web scraping legal?
A web crawler, often shortened to crawler or sometimes called a spider-bot, is a bot that systematically browses the internet typically for the purpose of web indexing. These internet bots can be used by search engines to improve the quality of search results for users.
In addition to indexing the world wide web, crawling can also be used to gather data. This is known as web scraping.
Use cases for web scraping include collecting prices from a retailer’s site or hotel listings from a travel site, scraping email directories for sales leads, and gathering information to train machine learning models.
The process of web scraping can be quite tasking on the CPU, depending on the site’s structure and the complexity of data being extracted. You can use worker threads to optimize the CPU-intensive operations required to perform web scraping in
Installation
Launch a terminal and create a new directory for this tutorial:
$ mkdir worker-tutorial
$ cd worker-tutorial
Initialize the directory by running the following command:
$ yarn init -y
We need the following packages to build the crawler:
Axios, a promised based HTTP client for the browser and
Cheerio, a lightweight implementation of jQuery which gives us access to the DOM on the server
Firebase database, a cloud-hosted NoSQL database. If you’re not familiar with setting up a Firebase database, check out the documentation and follow steps 1-3 to get started
Let’s install the packages listed above with the following command:
$ yarn add axios cheerio firebase-admin
Before we start building the crawler using workers, let’s go over some basics. You can create a test file in the root of the project to run the following snippets.
Registering a worker
A worker can be initialized (registered) by importing the worker class from the worker_threads module like this:
//
const { Worker} = require(‘worker_threads’);
new Worker(“. /”);
Hello world
Printing out Hello World with workers is as simple as running the snippet below:
const { Worker, isMainThread} = require(‘worker_threads’);
if(isMainThread){
new Worker(__filename);} else{
(“Worker says: Hello World”); // prints ‘Worker says: Hello World’}
This snippet pulls in the worker class and the isMainThread object from the worker_threads module:
isMainThread helps us know when we are either running inside the main thread or a worker thread
new Worker(__filename) registers a new worker with the __filename variable which, in this case, is
When a new worker thread is spawned, there is a messaging port that allows inter-thread communications. Below is a snippet which shows how to pass messages between workers (threads):
const { Worker, isMainThread, parentPort} = require(‘worker_threads’);
if (isMainThread) {
const worker = new Worker(__filename);
(‘message’, (message) => {
(message); // prints ‘Worker thread: Hello! ‘});
Message(‘Main Thread: Hi! ‘);} else {
(message) // prints ‘Main Thread: Hi! ‘
Message(“Worker thread: Hello! “);});}
In the snippet above, we send a message to the parent thread using Message() after initializing a worker thread. Then we listen for a message from the parent thread using (). We also send a message to the worker thread using Message() and listen for a message from the worker thread using ().
Running the code produces the following output:
Main Thread: Hi!
Worker thread: Hello!
Let’s build a basic web crawler that uses Node workers to crawl and write to a database. The crawler will complete its task in the following order:
Fetch (request) HTML from the website
Extract the HTML from the response
Traverse the DOM and extract the table containing exchange rates
Format table elements (tbody, tr, and td) and extract exchange rate values
Stores exchange rate values in an object and send it to a worker thread using Message()
Accept message from parent thread in worker thread using ()
Store message in Firestore (Firebase database)
Let’s create two new files in our project directory:
for the main thread
for the worker thread
The source code for this tutorial is available here on GitHub. Feel free to clone it, fork it, or submit an issue.
In the main thread (), we will scrape the IBAN website for the current exchange rates of popular currencies against the US dollar. We will import axios and use it to fetch the HTML from the site using a simple GET request.
We will also use cheerio to traverse the DOM and extract data from the table element. To know the exact elements to extract, we will open the IBAN website in our browser and load dev tools:
From the image above, we can see the table element with the classes — table table-bordered table-hover downloads. This will be a great starting point and we can feed that into our cheerio root element selector:
const axios = require(‘axios’);
const cheerio = require(‘cheerio’);
const url = “;
fetchData(url)( (res) => {
const html =;
const $ = (html);
const statsTable = $(‘ > tbody > tr’);
(function() {
let title = $(this)(‘td’)();
(title);});})
async function fetchData(url){
(“Crawling data… “)
// make call to url
let response = await axios(url)((err) => (err));
if(! == 200){
(“Error occurred while fetching data”);
return;}
return response;}
Running the code above with Node will give the following output:
Going forward, we will update the file so that we can properly format our output and send it to our worker thread.
Updating the main thread
To properly format our output, we need to get rid of white space and tabs since we will be storing the final output in JSON. Let’s update the file accordingly:
[… ]
let workDir = __dirname+”/”;
const mainFunc = async () => {
// fetch html data from iban website
let res = await fetchData(url);
if(! ){
(“Invalid data Obj”);
let dataObj = new Object();
// mount html page to the root element
//loop through all table rows and get table data
let title = $(this)(‘td’)(); // get the text in all the td elements
let newStr = (“\t”); // convert text (string) into an array
(); // strip off empty array element at index 0
formatStr(newStr, dataObj); // format array string and store in an object});
return dataObj;}
mainFunc()((res) => {
// start worker
const worker = new Worker(workDir);
(“Sending crawled data to dbWorker… “);
// send formatted data to worker thread
Message(res);
// listen to message from worker thread
(“message”, (message) => {
(message)});});
function formatStr(arr, dataObj){
// regex to match all the words before the first digit
let regExp = /[^A-Z]*(^\D+)/
let newArr = arr[0](regExp); // split array element 0 using the regExp rule
dataObj[newArr[1]] = newArr[2]; // store object}
In the snippet above, we are doing more than data formatting; after the mainFunc() has been resolved, we pass the formatted data to the worker thread for storage.
In this worker thread, we will initialize Firebase and listen for the crawled data from the main thread. When the data arrives, we will store it in the database and send a message back to the main thread to confirm that data storage was successful.
The snippet that takes care of the aforementioned operations can be seen below:
const { parentPort} = require(‘worker_threads’);
const admin = require(“firebase-admin”);
//firebase credentials
let firebaseConfig = {
apiKey: “XXXXXXXXXXXX-XXX-XXX”,
authDomain: “XXXXXXXXXXXX-XXX-XXX”,
databaseURL: “XXXXXXXXXXXX-XXX-XXX”,
projectId: “XXXXXXXXXXXX-XXX-XXX”,
storageBucket: “XXXXXXXXXXXX-XXX-XXX”,
messagingSenderId: “XXXXXXXXXXXX-XXX-XXX”,
appId: “XXXXXXXXXXXX-XXX-XXX”};
// Initialize Firebase
itializeApp(firebaseConfig);
let db = restore();
// get current data in DD-MM-YYYY format
let date = new Date();
let currDate = `${tDate()}-${tMonth()}-${tFullYear()}`;
// recieve crawled data from main thread
(“Recieved data from mainWorker… “);
// store data gotten from main thread in database
llection(“Rates”)(currDate)({
rates: ringify(message)})(() => {
// send data back to main thread if operation was successful
Message(“Data saved successfully”);})
((err) => (err))});
Note: To set up a database on firebase, please visit the firebase documentation and follow steps 1-3 to get started.
Running (which encompasses) with Node will give the following output:
You can now check your firebase database and will see the following crawled data:
Although web scraping can be fun, it can also be against the law if you use data to commit copyright infringement. It is generally advised that you read the terms and conditions of the site you intend to crawl, to know their data crawling policy beforehand.
You should learn more about web crawling policy before undertaking your own web scraping project.
The use of worker threads does not guarantee your application will be faster but can present that mirage if used efficiently because it frees up the main thread by making CPU-intensive tasks less cumbersome on the main thread.
Conclusion
In this tutorial, we learned how to build a web crawler that scrapes currency exchange rates and saves it to a database. We also learned how to use worker threads to run these operations.
The source code for each of the following snippets is available on GitHub. Feel free to clone it, fork it or submit an issue.
Further reading
Interested in learning more about worker threads? You can check out the following links:
Worker threads
multithreading: What are Worker Threads and why do they matter?
Going Multithread with
Simple bidirectional messaging in Worker Threads
200’s only Monitor failed and slow network requests in production Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third party services are successful, try LogRocket. is like a DVR for web apps, recording literally everything that happens on your site. Instead of guessing why problems happen, you can aggregate and report on problematic network requests to quickly understand the root cause. LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.
Frequently Asked Questions about javascript web crawler
What is JavaScript crawler?
The HTML Crawler uses the traditional method of downloading the source HTML and parsing it, without rendering JavaScript. The Chrome Crawler utilises headless Chromium (like Google) to render the page, then parse the rendered HTML.Aug 25, 2021
How do you crawl in JavaScript?
To crawl a JavaScript website, open up the SEO Spider, click ‘Configuration > Spider > Rendering’ and change ‘Rendering’ to ‘JavaScript’.
What is crawler in Nodejs?
Node Web Crawler is a web spider written with Nodejs. It gives you the full power of jQuery on the server to parse a big number of pages as they are downloaded, asynchronously. Scraping should be simple and fun! NPM.