• November 20, 2024

Scala Web Scraper

scala-scraper - Scaladex

scala-scraper – Scaladex

A library providing a DSL for loading and extracting content from HTML pages.
Take a look at and at the unit specs for usage examples or keep reading for more thorough documentation. Feel free to use GitHub Issues for submitting any bug or feature request and Gitter to ask questions.
This README contains the following sections:
Quick Start
Core Model
Browsers
Content Extraction
Content Validation
Other DSL Features
Using Browser-Specific Features
Working Behind an HTTP/HTTPS Proxy
Integration with Typesafe Config
New Features and Migration Guide
Copyright
To use Scala Scraper in an existing SBT project with Scala 2. 11 or newer, add the following dependency to your
libraryDependencies += “ippeixotog”%% “scala-scraper”% “2. 2. 1”
If you are using an older version of this library, see this document for the version you’re using: 1. x, 0. 1. 2, 0. 1, 0. 1.
An implementation of the Browser trait, such as JsoupBrowser, can be used to fetch HTML from the web or to parse a local HTML file or string:
import owser. JsoupBrowser
val browser = JsoupBrowser()
val doc = rseFile(“core/src/test/resources/”)
val doc2 = (“)
The returned object is a Document, which already provides several methods for manipulating and querying HTML elements. For simple use cases, it can be enough. For others, this library improves the content extracting process by providing a powerful DSL.
You can open the file loaded above to follow the examples throughout the README.
First of all, the DSL methods and conversions must be imported:
import
Content can then be extracted using the >> extraction operator and CSS queries:
// Extract the text inside the element with id “header”
doc >> text(“#header”)
// res0: String = “Test page h1”
// Extract the elements inside #menu
val items = doc >> elementList(“#menu span”)
// items: List[Element] = List(
// JsoupElement(Home),
// JsoupElement(Section 1),
// JsoupElement(Section 2),
// JsoupElement(Section 3)
//)
// From each item, extract all the text inside their elements
(_ >> allText(“a”))
// res1: List[String] = List(“Home”, “Section 1”, “”, “Section 3”)
// From the meta element with “viewport” as its attribute name, extract the
// text in the content attribute
doc >> attr(“content”)(“meta[name=viewport]”)
// res2: String = “width=device-width, initial-scale=1”
If the element may or may not be in the page, the >? > tries to extract the content and returns it wrapped in an Option:
// Extract the element with id “footer” if it exists, return `None` if it
// doesn’t:
doc >? > element(“#footer”)
// res3: Option[Element] = Some(
// JsoupElement(
//

With only these two operators, some useful things can already be achieved:
// Go to a news website and extract the hyperlink inside the h1 element if it
// exists. Follow that link and print both the article title and its short
// description (inside “”)
for {
headline <- (") >? > element(“h1 a”)
headlineDesc = ((“href”)) >> text(“”)} println(“== ” + + ” ==\n” + headlineDesc)
In the next two sections the core classes used by this library are presented. They are followed by a description of the full capabilities of the DSL, including the ability to parse content after extracting, validating the contents of a page and defining custom extractors or validators.
The library represents HTML documents and their elements by Document and Element objects, simple interfaces containing methods for retrieving information and navigating through the DOM.
Browser implementations are the entrypoints for obtaining Document instances. Most notably, they implement get, post, parseFile and parseString methods for retrieving documents from different sources. Depending on the browser used, Document and Element instances may have different semantics, mainly on their immutability guarantees.
The library currently provides two built-in implementations of Browser:
JsoupBrowser is backed by jsoup, a Java HTML parser library. JsoupBrowser provides powerful and efficient document querying, but it doesn’t run JavaScript in the pages. As such, it is limited to working strictly with the HTML sent in the page source;
HtmlUnitBrowser is based on HtmlUnit, a GUI-less browser for Java programs. HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages in addition to parsing HTML. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.
Due to its speed and maturity, JsoupBrowser is the recommended browser to use when JavaScript execution is not needed. More information about each browser and its semantics can be obtained in the Scaladoc of each implementation.
The >> and >? > operators shown above accept an HtmlExtractor as their right argument, a trait with a very simple interface:
trait HtmlExtractor[-E <: Element, +A] { def extract(doc: ElementQuery[E]): A} One can always create a custom extractor by implementing HtmlExtractor. However, the DSL provides several ways to create HtmlExtractor instances, which should be enough in most situations. In general, you can use the extractor factory method: doc >> extractor(, , )
Where the arguments are:
cssQuery: the CSS query used to select the elements to be processed;
contentExtractor: the content to be extracted from the selected elements, e. g. the element objects themselves, their text, a specific attribute, form data;
contentParser: an optional parser for the data extracted in the step above, such as parsing numbers and dates or using regexes.
The DSL provides several contentExtractor and contentParser instances, which were imported before with DSL. Extract. _ and The full list can be seen in and
Some usage examples:
// Extract the date from the “#date” element
doc >> extractor(“#date”, text, asLocalDate(“yyyy-MM-dd”))
// res5: = 2014-10-26
// Extract the text of all “#mytable td” elements and parse each of them as a number
doc >> extractor(“#mytable td”, texts, seq(asDouble))
// res6: TraversableOnce[Double] = non-empty iterator
// Extract an element “h1” and do no parsing (the default parsing behavior)
doc >> extractor(“h1”, element, asIs[Element])
// res7: Element = JsoupElement(

Test page h1

)
With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:
is taken as extractor(, elements, asIs) (by an implicit conversion);
is taken as extractor(“:root”, , asIs) (content extractors are also HtmlExtractor instances by themselves);
() is taken as extractor(, , asIs) (by an implicit conversion).
Because of that, one can write the expressions in the Quick Start section, as well as:
// Extract all the “h3” elements (as a lazy iterable)
doc >> “h3”
// res8: ElementQuery[Element] = LazyElementQuery(
// JsoupElement(

Section 1 h3

),
// JsoupElement(

Section 2 h3

),
// JsoupElement(

Section 3 h3

)
// Extract all text inside this document
doc >> allText
// res9: String = “Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4. 5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014”
// Extract the elements with class “”
doc >> elementList(“”)
// res10: List[Element] = List(
// JsoupElement(Section 2)
// Extract the text inside each “p” element
doc >> texts(“p”)
// res11: Iterable[String] = List(
// “Some text for testing”,
// “More text for testing”
While scraping web pages, it is a common use case to validate if a page effectively has the expected structure. This library provides special support for creating and applying validations.
A HtmlValidator has the following signature:
trait HtmlValidator[-E <: Element, +R] { def matches(doc: ElementQuery[E]): Boolean def result: Option[R]} As with extractors, the DSL provides the validator constructor and the >/~ operator for applying a validation to a document:
doc >/~ validator()()
extractor: an extractor as defined in the previous section;
matcher: a function mapping the extracted content to a boolean indicating if the document is valid.
The result of a validation is an Either[R, A] instance, where A is the type of the document and R is the result type of the validation (which will be explained later).
Some validation examples:
// Check if the title of the page is “Test page”
doc >/~ validator(text(“title”))(_ == “Test page”)
// res12: Either[Unit, cumentType] = Right(
// JsoupDocument(
//
//
//
//
//
// Test page
//
//
//

//