• April 13, 2024

Node Js Xpath Html

Performant parsing of HTML pages with Node.js and XPath

I’m into some web scraping with I’d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.
jsdom is extremely slow. It’s parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.
Popular libraries for HTML parsing (e. g. cheerio) neither support XPath, nor expose W3C-compliant DOM.
Effective HTML parsing is, obviously, implemented in WebKit, so using phantom or casper would be an option, but those require to be running in a special way, not just node