The summary text for this Location page is simply the first
paragraph tag within:
And the name and text for each property and method are within a
- definition list, as pairs of
- and
- tags:
-
otocol
- …
And the. nextElementSibling of each to find the tag containing the description:
@ val nameDescPairs = (element => (element, xtElementSibling))
nameDescPairs: [(nodes. Element, nodes. Element)] = ArrayBuffer(,
- …
-
Is a
DOMString
containing the entire URL. If changed, the associated document navigates to the new page. It can be set from a different origin than the associated document.
),…
And finally calling to get the raw text of each:
@ val textPairs = {case (k, v) => (, )}
textPairs: [(String, String)] = ArrayBuffer(
“”,
“Is a DOMString containing the entire URL. It can be set from a different origin than the associated document. “),
“otocol”,
“Is a DOMString containing the protocol scheme of the URL, including the final ‘:’. “),
“Is a DOMString containing the host, that is the hostname, a ‘:’, and the port of the URL. “),…
Putting it Together
We now have snippets of code that let us scrape the index of Web APIs from MDN, giving us a list of every API documented:
“AbortController”),…
And also to scrape the page of each individual API, fetching summary documentation for both that API and each individual property and method:
“Is a DOMString containing the protocol scheme of the URL, including the final ‘:’. “),…
We can thus combine both of these together into a single piece of code that will loop over the pages linked from the index, and then fetch the summary paragraph and method/property details for each documented API:
@ {
import $ivy. “,
val indexDoc = nnect(“)()
val links = (“h2#Interfaces”). (” a”)
val linkData = (link => ((“href”), (“title”), ))
val articles = for((url, tooltip, name) <- linkData) yield {
println("Scraping " + name)
val doc = nnect(" + url)()
val summary = ("article#wikiArticle > p”)(“”)()
val methodsAndProperties = doc
(“article#wikiArticle dl dt”). asScala
(elem => (, xtElementSibling match {case null => “” case x =>}))
(url, tooltip, name, summary, methodsAndProperties)}}
Note that I added a bit of error handling in here: rather than fetching the summary text via, I use. (“”)() to account for the possibility that there is no summary paragraph. Similarly, I check. nextElementSibling to see if it is null before calling to fetch its contents. Other than that, it is essentially the same as the snippets we saw earlier.
This should take a few minutes to run, as it has to fetch every page individually to parse and extract the data we want. After it’s done, we should see the following output:
articles: [(String, String, String, String, [(String, String)])] = ArrayBuffer(
“ANGLE_instanced_arrays”,
ArrayBuffer(
“RTEX_ATTRIB_ARRAY_DIVISOR_ANGLE”,
“Returns a GLint describing the frequency divisor used for instanced rendering when used in the tVertexAttrib() as the pname parameter. “),
“ext. drawArraysInstancedANGLE()”,
“Behaves identically to gl. drawArrays() except that multiple instances of the range of elements are executed, and the instance advances for each iteration. “),
(…
articles contains the first-paragraph summary of every documentation. We can see how many articles we have scraped in total:
@
res60: Int = 917
As well as how many member and property documentation snippets we have fetched.
@ (_. )
res61: Int = 16583
Lastly, if we need to use this information elsewhere, it is easy to dump to a JSON file that can be accessed from some other process:
@ ( / “”, (articles, indent = 4))
@ ( / “”)
res65: String = “””[
[
“Returns a GLint describing the frequency divisor used for instanced rendering when used in the tVertexAttrib() as the pname parameter. “],
“Behaves identically to gl. “],
“ext. drawElementsInstancedANGLE()”,…
Conclusion
In this tutorial, we have walked through the basics of using the Scala programming language and Jsoup HTML parser to scrape semi-structured data off of human-readable HTML pages: specifically taking the well-known MDN Web API Documentation, and extracting summary documentation for every interface, method and property documented within it. We did so interactively in the REPL, and were able to see immediately what our code did to extract information from the semi-structured human-readable HTML. For re-usability, you may want to place the code in a Scala Script, or for more sophisticated scrapers use a proper build tool like Mill.
This post only covers the basics of web scraping: for websites that need user accounts and logic, you may need to use Requests-Scala to go through the HTTP signup flow and maintain a user session. For websites that require Javascript to function, you may need a more fully-featured browser automation tool like Selenium. Nevertheless, this should be enough to get you started with the basic concepts of fetching, navigating and scraping HTML websites using the Author: Haoyi is a software engineer, and the author of many open-source Scala tools such as the Ammonite REPL and the Mill Build Tool. If you enjoyed the contents on this blog, you may also enjoy Haoyi’s book Hands-on Scala Programming