Scala Web Scraper

November 16, 2021
0

scala-scraper - Scaladex

scala-scraper – Scaladex

A library providing a DSL for loading and extracting content from HTML pages.
Take a look at and at the unit specs for usage examples or keep reading for more thorough documentation. Feel free to use GitHub Issues for submitting any bug or feature request and Gitter to ask questions.
This README contains the following sections:
Quick Start
Core Model
Browsers
Content Extraction
Content Validation
Other DSL Features
Using Browser-Specific Features
Working Behind an HTTP/HTTPS Proxy
Integration with Typesafe Config
New Features and Migration Guide
Copyright
To use Scala Scraper in an existing SBT project with Scala 2. 11 or newer, add the following dependency to your
libraryDependencies += “ippeixotog”%% “scala-scraper”% “2. 2. 1”
If you are using an older version of this library, see this document for the version you’re using: 1. x, 0. 1. 2, 0. 1, 0. 1.
An implementation of the Browser trait, such as JsoupBrowser, can be used to fetch HTML from the web or to parse a local HTML file or string:
import owser. JsoupBrowser
val browser = JsoupBrowser()
val doc = rseFile(“core/src/test/resources/”)
val doc2 = (“)
The returned object is a Document, which already provides several methods for manipulating and querying HTML elements. For simple use cases, it can be enough. For others, this library improves the content extracting process by providing a powerful DSL.
You can open the file loaded above to follow the examples throughout the README.
First of all, the DSL methods and conversions must be imported:
import
Content can then be extracted using the >> extraction operator and CSS queries:
// Extract the text inside the element with id “header”
doc >> text(“#header”)
// res0: String = “Test page h1”
// Extract the elements inside #menu
val items = doc >> elementList(“#menu span”)
// items: List[Element] = List(
// JsoupElement(Home),
// JsoupElement(Section 1),
// JsoupElement(Section 2),
// JsoupElement(Section 3)
//)
// From each item, extract all the text inside their elements
(_ >> allText(“a”))
// res1: List[String] = List(“Home”, “Section 1”, “”, “Section 3”)
// From the meta element with “viewport” as its attribute name, extract the
// text in the content attribute
doc >> attr(“content”)(“meta[name=viewport]”)
// res2: String = “width=device-width, initial-scale=1”
If the element may or may not be in the page, the >? > tries to extract the content and returns it wrapped in an Option:
// Extract the element with id “footer” if it exists, return `None` if it
// doesn’t:
doc >? > element(“#footer”)
// res3: Option[Element] = Some(
// JsoupElement(
//

With only these two operators, some useful things can already be achieved:
// Go to a news website and extract the hyperlink inside the h1 element if it
// exists. Follow that link and print both the article title and its short
// description (inside “”)
for {
headline <- (") >? > element(“h1 a”)
headlineDesc = ((“href”)) >> text(“”)} println(“== ” + + ” ==\n” + headlineDesc)
In the next two sections the core classes used by this library are presented. They are followed by a description of the full capabilities of the DSL, including the ability to parse content after extracting, validating the contents of a page and defining custom extractors or validators.
The library represents HTML documents and their elements by Document and Element objects, simple interfaces containing methods for retrieving information and navigating through the DOM.
Browser implementations are the entrypoints for obtaining Document instances. Most notably, they implement get, post, parseFile and parseString methods for retrieving documents from different sources. Depending on the browser used, Document and Element instances may have different semantics, mainly on their immutability guarantees.
The library currently provides two built-in implementations of Browser:
JsoupBrowser is backed by jsoup, a Java HTML parser library. JsoupBrowser provides powerful and efficient document querying, but it doesn’t run JavaScript in the pages. As such, it is limited to working strictly with the HTML sent in the page source;
HtmlUnitBrowser is based on HtmlUnit, a GUI-less browser for Java programs. HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages in addition to parsing HTML. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.
Due to its speed and maturity, JsoupBrowser is the recommended browser to use when JavaScript execution is not needed. More information about each browser and its semantics can be obtained in the Scaladoc of each implementation.
The >> and >? > operators shown above accept an HtmlExtractor as their right argument, a trait with a very simple interface:
trait HtmlExtractor[-E <: Element, +A] { def extract(doc: ElementQuery[E]): A} One can always create a custom extractor by implementing HtmlExtractor. However, the DSL provides several ways to create HtmlExtractor instances, which should be enough in most situations. In general, you can use the extractor factory method: doc >> extractor(, , )
Where the arguments are:
cssQuery: the CSS query used to select the elements to be processed;
contentExtractor: the content to be extracted from the selected elements, e. g. the element objects themselves, their text, a specific attribute, form data;
contentParser: an optional parser for the data extracted in the step above, such as parsing numbers and dates or using regexes.
The DSL provides several contentExtractor and contentParser instances, which were imported before with DSL. Extract. _ and The full list can be seen in and
Some usage examples:
// Extract the date from the “#date” element
doc >> extractor(“#date”, text, asLocalDate(“yyyy-MM-dd”))
// res5: = 2014-10-26
// Extract the text of all “#mytable td” elements and parse each of them as a number
doc >> extractor(“#mytable td”, texts, seq(asDouble))
// res6: TraversableOnce[Double] = non-empty iterator
// Extract an element “h1” and do no parsing (the default parsing behavior)
doc >> extractor(“h1”, element, asIs[Element])
// res7: Element = JsoupElement(

Test page h1

)
With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:
is taken as extractor(, elements, asIs) (by an implicit conversion);
is taken as extractor(“:root”, , asIs) (content extractors are also HtmlExtractor instances by themselves);
() is taken as extractor(, , asIs) (by an implicit conversion).
Because of that, one can write the expressions in the Quick Start section, as well as:
// Extract all the “h3” elements (as a lazy iterable)
doc >> “h3”
// res8: ElementQuery[Element] = LazyElementQuery(
// JsoupElement(

Section 1 h3

),
// JsoupElement(

Section 2 h3

),
// JsoupElement(

Section 3 h3

)
// Extract all text inside this document
doc >> allText
// res9: String = “Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4. 5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014”
// Extract the elements with class “”
doc >> elementList(“”)
// res10: List[Element] = List(
// JsoupElement(Section 2)
// Extract the text inside each “p” element
doc >> texts(“p”)
// res11: Iterable[String] = List(
// “Some text for testing”,
// “More text for testing”
While scraping web pages, it is a common use case to validate if a page effectively has the expected structure. This library provides special support for creating and applying validations.
A HtmlValidator has the following signature:
trait HtmlValidator[-E <: Element, +R] { def matches(doc: ElementQuery[E]): Boolean def result: Option[R]} As with extractors, the DSL provides the validator constructor and the >/~ operator for applying a validation to a document:
doc >/~ validator()()
extractor: an extractor as defined in the previous section;
matcher: a function mapping the extracted content to a boolean indicating if the document is valid.
The result of a validation is an Either[R, A] instance, where A is the type of the document and R is the result type of the validation (which will be explained later).
Some validation examples:
// Check if the title of the page is “Test page”
doc >/~ validator(text(“title”))(_ == “Test page”)
// res12: Either[Unit, cumentType] = Right(
// JsoupDocument(
//
//
//
//
//
// Test page
//
//
//

Test page h1

Home Section 1 Section 2 Section 3
//

Test page h2

2014-10-26 2014-10-26T12:30:05Z 4. 5 2
//

Section 1 h3

Some text for testing

More text for testing

Section 2 h3

My Form
//

Section 3 h3

//…
// Check if there are at least 3 “” elements
doc >/~ validator(“”)( >= 3)
// res13: Either[Unit, cumentType] = Left(())
// Check if the text in “” contains the word “blue”
doc >/~ validator(allText(“#mytable”))(ntains(“blue”))
// res14: Either[Unit, cumentType] = Left(())
When a document fails a validation, it may be useful to identify the problem by pattern-matching it against common scraping pitfalls, such as a login page that appears unexpectedly because of an expired cookie, dynamic content that disappeared or server-side errors. If we define validators for both the success case and error cases:
val succ = validator(text(“title”))(_ == “My Page”)
val errors = Seq(
validator(allText(“”), “Not logged in”)(ntains(“sign in”)),
validator(“”, “Too few items”)( < 3), validator(text("h1"), "Internal Server Error")(ntains("500"))) They can be used in combination to create more informative validations: doc >/~ (succ, errors)
// res15: Either[String, cumentType] = Left(“Too few items”)
Validators matching errors were constructed above using an additional result parameter after the extractor. That value is returned wrapped in a Left if that particular error occurs during a validation.
As shown before in the Quick Start section, one can try if an extractor works in a page and obtain the extracted content wrapped in an Option:
// Try to extract an element with id “optional”, return `None` if none exist
doc >? > element(“#optional”)
// res16: Option[Element] = None
Note that when using >? > with content extractors that return sequences, such as texts and elements, None will never be returned (Some(Seq()) will be returned instead).
If you want to use multiple extractors in a single document or element, you can pass tuples or triples to >>:
// Extract the text of the title element and all inputs of #myform
doc >> (text(“title”), elementList(“#myform input”))
// res17: (String, List[Element]) = (
// “Test page”,
// List(
// JsoupElement(),
// JsoupElement(),
// JsoupElement()
The extraction operators work on List, Option, Either and other instances for which a Scalaz Functor instance exists. The extraction occurs by mapping over the functors:
// Extract the titles of all documents in the list
List(doc, doc) >> text(“title”)
// res18: List[String] = List(“Test page”, “Test page”)
// Extract the title if the document is a `Some`
Option(doc) >> text(“title”)
// res19: Option[String] = Some(“Test page”)
You can apply other extractors and validators to the result of an extraction, which is particularly powerful combined with the feature shown above:
// From the “#menu” element, extract the text in the “” element inside
doc >> element(“#menu”) >> text(“”)
// res20: String = “Section 2”
// Same as above, but in a scenario where “#menu” can be absent
doc >? > element(“#menu”) >> text(“”)
// res21: Option[String] = Some(“Section 2”)
// Same as above, but check if the “#menu” has any “span” element before
// extracting the text
doc >? > element(“#menu”) >/~ validator(“span”)(nEmpty) >> text(“”)
// res22: Option[Either[Unit, String]] = Some(Right(“Section 2”))
// Extract the links inside all the “#menu > span” elements
doc >> elementList(“#menu > span”) >? > attr(“href”)(“a”)
// res23: List[Option[String]] = List(
// Some(“#home”),
// Some(“#section1”),
// None,
// Some(“#section3”)
This library also provides a Functor for HtmlExtractor, making it possible to map over extractors and create chained extractors that can be passed around and stored like objects. For example, new extractors can be defined like this:
import mlExtractor
// An extractor for a list with the first link found in each “span” element
val spanLinks: HtmlExtractor[Element, List[Option[String]]] =
elementList(“span”) >? > attr(“href”)(“a”)
// An extractor for the number of “span” elements that actually have links
val spanLinksCount: HtmlExtractor[Element, Int] =
(_. )
You can also “prepend” a query to any existing extractor by using its mapQuery method:
// An extractor for `spanLinks` that are inside “#menu”
val menuLinks: HtmlExtractor[Element, List[Option[String]]] =
pQuery(“#menu”)
And they can be used just as extractors created using other means provided by the DSL:
doc >> spanLinks
// res24: List[Option[String]] = List(
// Some(“#section3”),
// Some(“#”),
// None
doc >> spanLinksCount
// res25: Int = 4
doc >> menuLinks
// res26: List[Option[String]] = List(
Just remember that you can only apply extraction operators >> and >? > to documents, elements or functors “containing” them, which means that the following is a compile-time error:
// The `texts` extractor extracts a list of strings and extractors cannot be
// applied to strings
doc >> texts(“#menu > span”) >> “a”
// error: value >> is not a member of Iterable[String]
// doc >> texts(“#menu > span”) >> “a”
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Finally, if you prefer not using operators for the sake of code legibility, you can use alternative methods:
// `extract` is the same as `>>`
doc extract text(“title”)
// res28: String = “Test page”
// `tryExtract` is the same as `>? >`
doc tryExtract element(“#optional”)
// res29: Option[Element] = None
// `validateWith` is the same as `>/~`
doc validateWith (succ, errors)
// res30: Either[String, cumentType] = Left(“Too few items”)
NOTE: this feature is in a beta stage. Please expect API changes in future releases.
At this moment, Scala Scraper is focused on providing a DSL for querying documents efficiently and elegantly. Therefore, it doesn’t support directly modifying the DOM or executing actions such as clicking an element. However, since version 2. 0. 0 a new typed element API allows users to interact directly with the data structures of the underlying Browser implementation.
First of all, make sure your Browser instance has a concrete type, like HtmlUnitBrowser:
import mlUnitBrowser
import mlUnitBrowser. _
// the `typed` method on the companion object of a `Browser` returns instances
// with their concrete type
val typedBrowser: HtmlUnitBrowser = ()
val typedDoc: HtmlUnitDocument = rseFile(“core/src/test/resources/”)
Note that the val declarations are explicitly typed for explanation purposes only; the methods work just as well when types are inferred.
The content extractors pElement, pElements and pElementList are special types of extractors – they are polymorphic extractors. They work just like their non-polymorphic element, elements and elementList extractors, but they propagate the concrete types of the elements if the document or element being extracted also has a concrete type. For example:
// extract the “a” inside the second child of “#menu”
val aElem = typedDoc >> pElement(“#menu span:nth-child(2) a”)
// aElem: HtmlUnitElement = HtmlUnitElement(HtmlAnchor[])
Note that extracting using CSS queries also keeps the concrete types of the elements:
// same thing as above
typedDoc >> “#menu” >> “span:nth-child(2)” >> “a” >> pElement
// res31: [HtmlUnitElement] = HtmlUnitElement(
// HtmlAnchor[]
Concrete element types, like HtmlUnitElement, expose a public underlying field with the underlying element object used by the browser backend. In the case of HtmlUnit, that would be a DomElement, which exposes a whole new range of operations:
// extract the current “href” this “a” element points to
aElem >> attr(“href”)
// res32: String = “#section1”
// use `underlying` to update the “href” attribute
tAttribute(“href”, “#section1_2”)
// verify that “href” was updated
// res34: String = “#section1_2”
// get the location of the document (without the host and the full path parts)
(“/”)
// res35: String = “”
def click(elem: HtmlUnitElement): Unit = {
// the type param may be needed, as the original API uses Java wildcards
[]()}
// simulate a click on our recently modified element
click(aElem)
// check the new location
// res37: String = “”
Using the typed element API provides much more flexibility when more than querying elements is required. However, one should avoid using it unless strictly necessary, as:
It binds code to specific Browser implementations, making it more difficult to change implementations later;
The code becomes subject to changes in the API of the underlying library;
It’s heavier on the Scala type system and it is not as mature, leading to possible unexpected compilation errors. If that happens, please file an issue!
If you are behind an HTTP proxy, you can configure Browser implementations to make connections through it by setting the Java system properties oxyHost, oxyHost, oxyPort and oxyPort. Scala Scraper provides a ProxyUtils object that facilitates that configuration:
tProxy(“localhost”, 3128)
val browser2 = JsoupBrowser()
// HTTP requests and scraping operations…
moveProxy()
JsoupBrowser uses internally Configuring those JVM-wide system properties will affect not only Browser instances, but all requests done using HttpURLConnection directly or indirectly. HtmlUnitBrowser was implementated so that it reads the same system properties for configuration, but once the browser is created they will be used on every request done by the instance, regardless of the properties’ values at the time of the request.
The Scala Scraper Config module can be used to load extractors and validators from config files.
The CHANGELOG is kept updated with the bug fixes and new features of each version. When there are breaking changes, they are listed there together with suggestions for migrating old code.
Copyright (c) 2014-2020 Rui Gonçalves. See LICENSE for details.

Web Scraping with Scala [closed] – Stack Overflow

I don’t have a Scala-specific recommendation, but for the JVM in general I’ve had good success with:
JSoup You can CSS selectors to “scrape” the document. Really nice to work with.
Use Tagsoup to get your input HTML to XML, then use XML processors to “Scrape”.
The Tagsoup route actually works quite well with Scala since Scala’s built-in XML “dsl” is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.
tl;dr – JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.

Scraping Websites using Scala and Jsoup – Haoyi’s …

Not every website exposes their data through a JSON API: in many cases the HTML page shown to users is all you get. This tutorial will walk you through using Scala to scrape useful information from human-readable HTML pages, unlocking the ability to programmatically extract data from online websites or services that were never designed for programmatic access via an the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi’s book Hands-on Scala Programming
To work with these human-readable HTML webpages, we will be using the Ammonite Scala REPL along with the Jsoup HTML query library. To begin with, I will install Ammonite:
$ sudo sh -c ‘(echo “#! /usr/bin/env sh” && curl -L) > /usr/local/bin/amm && chmod +x /usr/local/bin/amm’
And then use import $ivy to download the latest version of Jsoup:
$ amm
Loading…
Welcome to the Ammonite Repl 1. 6. 8
(Scala 2. 13. 0 Java 11. 0. 2)
If you like Ammonite, please support our development at
@ import $ivy. “,
100. 0% [##########] 8. 1 KiB (4. 1 KiB / s)
100. 0% [##########] 182. 3 KiB (55. 6 KiB / s)
100. 0% [##########] 387. 8 KiB (67. 1 KiB / s)
import $ivy. $
Next, we can follow the first example in the Jsoup documentation and call in order to download a simple web page to get started:
@ val doc = nnect(“)()
doc: cument =

Wikipedia, the free encyclopedia…
@ ()
res4: String = “Wikipedia, the free encyclopedia”
@ val headlines = (“#mp-itn b a”)
headlines: select. Elements = An earthquake
unanimously rules
September 2019 prorogation of Parliament
2I/Borisov…
@ import Converters. _
import Converters. _
@ for(headline <- Scala) yield (("title"), ("href")) res9: [(String, String)] = ArrayBuffer( ("2019 Kashmir earthquake", "/wiki/2019_Kashmir_earthquake"), ("2019 British prorogation controversy", "/wiki/2019_British_prorogation_controversy"), ("2I/Borisov", "/wiki/2I/Borisov"), ("71st Primetime Emmy Awards", "/wiki/71st_Primetime_Emmy_Awards"),... @ for(headline <- Scala) yield res10: [String] = ArrayBuffer( "An earthquake", "unanimously rules", "September 2019 prorogation of Parliament", "2I/Borisov",... This snippet front page of Wikipedia as a HTML document, then extracts the links and titles of the "In the News" Basics Now that we've run through the simple example from the Jsoup website, let's look into more detail what we just did: Most functionality in the Jsoup library lives on Above we used. connect to ask Jsoup to download a HTML page from a URL and parse it for us, but we can also use to parse a string we have available locally: @ ("

hello

world

“)
res10: cument =

hello

world

For example, could be useful if we already downloaded the HTML files ahead of time, and just need to do the parsing without any fetching.
While Jsoup provides a myriad of different ways for querying and modifying a document, we will focus on just a few:,, and. attrSelection
@ val headlines = (“#mp-itn b a”). asScala
headlines: [nodes. Element] = Buffer(
An earthquake,
unanimously rules,
September 2019 prorogation of Parliament,…
is the main way you can query for data within a HTML document. It takes a CSS Selector string, and uses that to select one or more elements within the document that you may be interested in. The basics of CSS selectors are as follows:
foo selects all elements with that tag name, e. g.
#foo selects all elements with that ID, e.

selects all elements with that ID, e.

. It also works with multiple classes, e.

, as long as one of the classes matches
Selectors combined without spaces selects elements that support all of them, e. would match an element
Selectors combined with spaces finds elements that support the leftmost selector, then any (possibly nested) child elements that supports the next selector, and so forth. e. foo #bar would match the innermost div in

If you want to select only direct children, ignoring grandchildren and other elements nested more deeply within the HTML page, you can use the > character to do so, e. foo > #bar >
To come up with the selector that would give us the In the News articles, we can go to Wikipedia in the browser and right-click Inspect on the part of the page we care about:
Here, we can see
The enclosing

of that section of the page has id=”mp-itn”, meaning we can select for it using #mp-itn.
Within that div, we have an

list items.
Within each list item is a mix of text and other tags, but we can see that the links to each article are always bolded in a tag, and inside the there is an link tag
Thus, in order to select all those links, we can combine #mp-itn b and a into a single (“#mp-itn b a”) call.
Apart from, you also have convenience methods like,. nextAll,. nextSibling,. nextElementSibling, etc. to help you conveniently find what you want within the HTML document. We will find some of these useful later.
Now that we’ve gotten the elements that we want, the next step would be to get the data we want off of each element. HTML elements have three main things we care about:
Attributes of the form foo=”bar”, which Jsoup gives you via (“foo”)
Text contents, e. hello world, which Jsoup gives you via
Direct child elements, which Jsoup gives you via. children.
Once we have the elements we want, it is more convenient to convert it into a Scala collection to work with:
This lets us call. asScala on the resultant element list, so we can easily iterate over it and pick out the parts we want. Whether attributes like the mouse-over title or the link target href:
Or the text of the link the user will see on screen:
Thus, we are able to pick out the list of Wikipedia articles – and their titles and URLs – by using Jsoup to scrape the Wikipedia front raping documentation snippets off MDN
One source of semi-structured data is the Mozilla Development Network web API documentation:
This website contains manually-curated documentation for the plethora of APIs you have available when writing Javascript code to run in the browser, under the Interfaces section, as shown below:
The various APIs are also tagged: with a blue beaker icon to mark experimental APIs, a black thumbsdown to mark deprecated APIs, and a red trashcan to mark those which have already been removed:
Each link brings you to the documentation of a single Javascript class, which has a short description for the class and a list of properties and methods, each with their own description:
This data is only semi-structured: as it is hand-written, not every page follows exactly the same layout, some pages will have missing sections while others might have extra detail that the documentation author thought would help clarify usage of these APIs. Nevertheless, this semi-structured information can still be very useful: perhaps you want to integrate it into your editor to automatically provide some hints and tips while you are working on your own Javascript code.
How can we convert this semi-structured human-readable MDN documentation website into something structured and machine-readable? Our approach will be as follows:
Scrape the main index page at to find a list of URLs to all other pages we might be interested in
Loop through each individual URL and scrape the relevant summary documentation from that page: documentation for the interface, each method, each property, etc.
Aggregate all the scraped summary documentation and save it to a file as JSON, for use later.
Scraping the Index
To begin with, we can right-click and Inspect the top-level index containing the links to each individual page:
From here, we can see that the

header can be used to identify the section we care about, all of the links are under the

that’s below it. We can thus select all those links via:

…
@ val links = (“h2#Interfaces”). (” a”)
links: select. Elements = ANGLE_instanced_arrays
AbortController
AbortSignal
AbsoluteOrientationSensor…
Note that while there appears to be a

in the browser console when we inspect the element, that div does not appear in the HTML we receive via nnect. This is because that div is added via Javascript, which Jsoup does not support. In general, pages using Javascript may appear slightly different in nnect from what you see in the browser, and you may need to dig around in the REPL with to find what you want.
From these elements, we can then extract the high-level information we want from each link: the URL, the mouse-over title, and the name of the page:
@ val linkData = (link => ((“href”), (“title”), ))
linkData: [(String, String, String)] = ArrayBuffer(
(
“/en-US/docs/Web/API/ANGLE_instanced_arrays”,
“The ANGLE_instanced_arrays extension is part of the WebGL API and allows to draw the same object, or groups of similar objects multiple times, if they share the same vertex data, primitive count and type. “,
“ANGLE_instanced_arrays”),
“/en-US/docs/Web/API/AbortController”,
“The AbortController interface represents a controller object that allows you to abort one or more DOM requests as and when desired. “,
“AbortController”),
“/en-US/docs/Web/API/AbortSignal”,
“The AbortSignal interface represents a signal object that allows you to communicate with a DOM request (such as a Fetch) and abort it if required via an AbortController object. “,
“AbortSignal”),…
From there, we can look into scraping data off of each individual raping Each Page
Let’s go back to the Location page we saw earlier:
If we inspect the HTML of the page, we can see that the main page contents is within an

tag:
The summary text for this Location page is simply the first

paragraph tag within:
And the name and text for each property and method are within a

definition list, as pairs of

and

tags:

otocol

…
And the. nextElementSibling of each to find the tag containing the description:
@ val nameDescPairs = (element => (element, xtElementSibling))
nameDescPairs: [(nodes. Element, nodes. Element)] = ArrayBuffer(

,

Is a
DOMString containing the entire URL. If changed, the associated document navigates to the new page. It can be set from a different origin than the associated document.

),…
And finally calling to get the raw text of each:
@ val textPairs = {case (k, v) => (, )}
textPairs: [(String, String)] = ArrayBuffer(
“”,
“Is a DOMString containing the entire URL. It can be set from a different origin than the associated document. “),
“otocol”,
“Is a DOMString containing the protocol scheme of the URL, including the final ‘:’. “),
“Is a DOMString containing the host, that is the hostname, a ‘:’, and the port of the URL. “),…
Putting it Together
We now have snippets of code that let us scrape the index of Web APIs from MDN, giving us a list of every API documented:
“AbortController”),…
And also to scrape the page of each individual API, fetching summary documentation for both that API and each individual property and method:
“Is a DOMString containing the protocol scheme of the URL, including the final ‘:’. “),…
We can thus combine both of these together into a single piece of code that will loop over the pages linked from the index, and then fetch the summary paragraph and method/property details for each documented API:
@ {
import $ivy. “,
val indexDoc = nnect(“)()
val links = (“h2#Interfaces”). (” a”)
val linkData = (link => ((“href”), (“title”), ))
val articles = for((url, tooltip, name) <- linkData) yield { println("Scraping " + name) val doc = nnect(" + url)() val summary = ("article#wikiArticle > p”)(“”)()
val methodsAndProperties = doc
(“article#wikiArticle dl dt”). asScala
(elem => (, xtElementSibling match {case null => “” case x =>}))
(url, tooltip, name, summary, methodsAndProperties)}}
Note that I added a bit of error handling in here: rather than fetching the summary text via, I use. (“”)() to account for the possibility that there is no summary paragraph. Similarly, I check. nextElementSibling to see if it is null before calling to fetch its contents. Other than that, it is essentially the same as the snippets we saw earlier.
This should take a few minutes to run, as it has to fetch every page individually to parse and extract the data we want. After it’s done, we should see the following output:
articles: [(String, String, String, String, [(String, String)])] = ArrayBuffer(
“ANGLE_instanced_arrays”,
ArrayBuffer(
“RTEX_ATTRIB_ARRAY_DIVISOR_ANGLE”,
“Returns a GLint describing the frequency divisor used for instanced rendering when used in the tVertexAttrib() as the pname parameter. “),
“ext. drawArraysInstancedANGLE()”,
“Behaves identically to gl. drawArrays() except that multiple instances of the range of elements are executed, and the instance advances for each iteration. “),
(…
articles contains the first-paragraph summary of every documentation. We can see how many articles we have scraped in total:
@
res60: Int = 917
As well as how many member and property documentation snippets we have fetched.
@ (_. )
res61: Int = 16583
Lastly, if we need to use this information elsewhere, it is easy to dump to a JSON file that can be accessed from some other process:
@ ( / “”, (articles, indent = 4))
@ ( / “”)
res65: String = “””[
[
“Returns a GLint describing the frequency divisor used for instanced rendering when used in the tVertexAttrib() as the pname parameter. “],
“Behaves identically to gl. “],
“ext. drawElementsInstancedANGLE()”,…
Conclusion
In this tutorial, we have walked through the basics of using the Scala programming language and Jsoup HTML parser to scrape semi-structured data off of human-readable HTML pages: specifically taking the well-known MDN Web API Documentation, and extracting summary documentation for every interface, method and property documented within it. We did so interactively in the REPL, and were able to see immediately what our code did to extract information from the semi-structured human-readable HTML. For re-usability, you may want to place the code in a Scala Script, or for more sophisticated scrapers use a proper build tool like Mill.
This post only covers the basics of web scraping: for websites that need user accounts and logic, you may need to use Requests-Scala to go through the HTTP signup flow and maintain a user session. For websites that require Javascript to function, you may need a more fully-featured browser automation tool like Selenium. Nevertheless, this should be enough to get you started with the basic concepts of fetching, navigating and scraping HTML websites using the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi’s book Hands-on Scala Programming

Frequently Asked Questions about scala web scraper

Tags: jsoup maven jsoup scala sbt scala html parser scala jsoup scala scraping library selenium scala web scraping spring boot

Firefox Timeout Settings

Speed Test Etc

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:

Recent Posts

Earn with ActProxy: Start Today

How To Know If My Ip Address Is Being Tracked

How Can You Change Your Ip Address

Is A Public Ip Address Safe

Anonymous Firefox Android

Hong Kong Proxy Server

Youtube Proxy France

How To Scrape Linkedin

Post Ad Gumtree

4G Proxy Usa

Scala Web Scraper

scala-scraper – Scaladex

Test page h1

Section 1 h3

Section 2 h3

Section 3 h3

Test page h1

Test page h2

Section 1 h3

Section 2 h3

Section 3 h3

Web Scraping with Scala [closed] – Stack Overflow

Scraping Websites using Scala and Jsoup – Haoyi’s …

hello

hello

Frequently Asked Questions about scala web scraper

Leave a Reply Cancel reply