Html_Fromstring

November 16, 2021
0

Parsing HTML - lxml

Parsing HTML – lxml

Author:
Ian Bicking
Since version 2. 0, lxml comes with a dedicated Python package for
dealing with HTML: It is based on lxml’s HTML parser,
but provides a special Element API for HTML elements, as well as a
number of utilities for common HTML processing tasks.
Contents
Parsing HTML
Parsing HTML fragments
Really broken pages
HTML Element Methods
Running HTML doctests
Creating HTML with the E-factory
Viewing your HTML
Working with links
Functions
Forms
Form Filling Example
Form Submission
Cleaning up HTML
autolink
wordwrap
HTML Diff
Examples
Microformat Example
The main API is based on the API, and thus, on the ElementTree
API.
There are several functions available to parse HTML:
parse(filename_url_or_file):
Parses the named file or url, or if the object has a ()
method, parses from that.
If you give a URL, or if the object has a () method (as
file-like objects from urllib. urlopen() have), then that URL
is used as the base URL. You can also provide an explicit
base_url keyword argument.
document_fromstring(string):
Parses a document from the given string. This always creates a
correct HTML document, which means the parent node is ,
and there is a body and possibly a head.
fragment_fromstring(string, create_parent=False):
Returns an HTML fragment from a string. The fragment must contain
just a single element, unless create_parent is given;
e. g., fragment_fromstring(string, create_parent=’div’) will
wrap the element in a

.
fragments_fromstring(string):
Returns a list of the elements found in the fragment.
fromstring(string):
Returns document_fromstring or fragment_fromstring, based
on whether the string looks like a full document, or just a
fragment.
The normal HTML parser is capable of handling broken HTML, but for
pages that are far enough from HTML to call them ‘tag soup’, it may
still fail to parse the page in a useful way. A way to deal with this
is ElementSoup, which deploys the well-known BeautifulSoup parser to
build an lxml HTML tree.
However, note that the most common problem with web pages is the lack
of (or the existence of incorrect) encoding declarations. It is
therefore often sufficient to only use the encoding detection of
BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml’s
own HTML parser, which is several times faster.
HTML elements have all the methods that come with ElementTree, but
also include some extra methods:. drop_tree():
Drops the element and all its children. Unlike
tparent()(el) this does not remove the tail
text; with drop_tree the tail text is merged with the previous
element.. drop_tag():
Drops the tag, but keeps its children and text.. find_class(class_name):
Returns a list of all the elements with the given CSS class name.
Note that class names are space separated in HTML, so
nd_class_name(‘highlight’) will find an element like

. Class names are case
sensitive.. find_rel_links(rel):
Returns a list of all the elements. E. g.,
nd_rel_links(‘tag’) returns all the links marked as
tags.. get_element_by_id(id, default=None):
Return the element with the given id, or the default if
none is found. If there are multiple elements with the same id
(which there shouldn’t be, but there often is), this returns only
the first.. text_content():
Returns the text content of the element, including the text
content of its children, with no markup.. cssselect(expr):
Select elements from this element and its children, using a CSS
selector expression. (Note that (expr) is also
available as on all lxml elements. )
Returns the corresponding element for this element, if
any exists (None if there is none). Label elements have a
r_element attribute that points back to the element.. base_url:
The base URL for this element, if one was saved from the parsing.
This attribute is not settable. Is None when no base URL was
saved.. classes:
Returns a set-like object that allows accessing and modifying the
names in the ‘class’ attribute of the element. (New in lxml 3. 5).
(key, value=None):
Sets an HTML attribute. If no value is given, or if the value is
None, it creates a boolean attribute like

. In XML, attributes must
have at least the empty string as their value like

, but HTML boolean attributes can also be
just present or absent from an element without having a value.
One of the interesting modules in the package deals with
doctests. It can be hard to compare two HTML pages for equality, as
whitespace differences aren’t meaningful and the structural formatting
can differ. This is even more a problem in doctests, where output is
tested for equality and small differences in whitespace or the order
of attributes can let a test fail. And given the verbosity of
tag-based languages, it may take more than a quick look to find the
actual differences in the doctest output.
Luckily, lxml provides the ctestcompare module that
supports relaxed comparison of XML and HTML pages and provides a
readable diff in the output when a test fails. The HTML comparison is
most easily used by importing the usedoctest module in a doctest:
>>> import
Now, if you have an HTML document and want to compare it to an expected result
document in a doctest, you can do the following:
>>> html = (”’\… …

Hi!

… … ”’)
>>> print (html)

Hi!

In documentation, you would likely prefer the pretty printed HTML output, as
it is the most readable. However, the three documents are equivalent from the
point of view of an HTML tool, so the doctest will silently accept any of the
above. This allows you to concentrate on readability in your doctests, even
if the real output is a straight ugly HTML one-liner.
Note that there is also an edoctest module which you can
import for XML comparisons. The HTML parser notably ignores
namespaces and some other XMLisms.
comes with a predefined HTML vocabulary for the E-factory,
originally written by Fredrik Lundh. This allows you to quickly generate HTML
pages and fragments:
>>> from import builder as E
>>> from import usedoctest
>>> html = (… (… (rel=”stylesheet”, href=””, type=”text/css”),… (“Best Page Ever”)… ),… H1((“heading”), “Top News”),… P(“World News only on this page”, style=”font-size: 200%”),… “Ah, and here’s some more text, by the way. “,… (“

… and this is a parsed fragment…

“)… )… )
Best Page Ever

Top News

World News only on this page

Ah, and here’s some more text, by the way.

…

Note that you should use and not
string. string(doc) will return the XML
representation of the document, which is not valid HTML. In
particular, things like will be
serialized as ... ...

... ... ... ... a link... another link...

a paragraph

...

... ... ...

... annoying EVIL! ... spam spam SPAM! ... ... ... '''
To remove the all suspicious content from this unparsed document, use the
clean_html function:
>>> from import clean_html
>>> print clean_html(html)

a link
another link

a paragraph

secret EVIL!

of EVIL!
Password:
annoying EVIL! spam spam SPAM!

The Cleaner class supports several keyword arguments to control exactly
which content is removed:
>>> from import Cleaner
>>> cleaner = Cleaner(page_structure=False, links=False)
>>> print ean_html(html)

annoying EVIL!
spam spam SPAM!

>>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,... page_structure=False, safe_attrs_only=False)
spam spam SPAM!
You can also whitelist some otherwise dangerous content with
Cleaner(host_whitelist=['']), which would allow
embedded media from YouTube, while still filtering out embedded media
from other sites.
See the docstring of Cleaner for the details of what can be
cleaned.
In addition to cleaning up malicious HTML,
contains functions to do other things to your HTML. This includes
autolinking:
autolink(doc,... )
autolink_html(html,... )
This finds anything that looks like a link (e. g., ) in the text of an HTML document, and
turns it into an anchor. It avoids making bad links.
Links in the elements , <pre>, <code>, anything in the head of the document. You can pass in a list of elements to avoid in avoid_elements=['textarea',... ]. Links to some hosts can be avoided. By default links to localhost*, example. * and 127. 0. 1 are not autolinked. Pass in avoid_hosts=[list_of_regexes] to control this. Elements with the nolink CSS class are not autolinked. Pass in avoid_classes=['code',... ] to control this. The autolink_html() version of the function parses the HTML string first, and returns a string. You can also wrap long words in your html: word_break(doc, max_width=40,... ) word_break_html(html,... ) This finds any long words in the text of the document and inserts in the document (which is the Unicode zero-width space). This avoids the elements <pre>, <textarea>, and <code>. You can control this with avoid_elements=['textarea',... ]. It also avoids elements with the CSS class nobreak. You can control this with avoid_classes=['code',... ]. Lastly you can control the character that is inserted with break_character=u'\u200b'. However, you cannot insert markup, only text. word_break_html(html) parses the HTML document and returns a string. The module offers some ways to visualize differences in HTML documents. These differences are content oriented. That is, changes in markup are largely ignored; only changes in the content itself are highlighted. There are two ways to view differences: htmldiff and html_annotate. One shows differences with <ins> and <del>, while the other annotates a set of changes similar to svn blame. Both these functions operate on text, and work best with content fragments (only what goes in <body>), not complete documents. Example of htmldiff: >>> from import htmldiff, html_annotate >>> doc1 = ''' Here is some text. ''' >>> doc2 = ''' Here is a lot of text. ''' >>> doc3 = ''' Here is a little text. ''' >>> print htmldiff(doc1, doc2) Here is <ins>a lot of text. </ins> <del>some text. </del> >>> print html_annotate([(doc1, 'author1'), (doc2, 'author2'),... (doc3, 'author3')]) Here is a little text . As you can see, it is imperfect as such things tend to be. On larger tracts of text with larger edits it will generally do better. The html_annotate function can also take an optional second argument, markup. This is a function like markup(text, version) that returns the given text marked up with the given version. The default version, the output of which you see in the example, looks like: def default_markup(text, version): return '%s'% ( (unicode(version), 1), text) This example parses the hCard microformat. First we get the page: >>> import urllib >>> from import fromstring >>> url = '' >>> content = urllib. urlopen(url)() >>> doc = fromstring(content) >>> ke_links_absolute(url) Then we create some objects to put the information in: >>> class Card(object):... def __init__(self, **kw):... for name, value in kw:... setattr(self, name, value) >>> class Phone(object):... def __init__(self, phone, types=()):..., = phone, types And some generally handy functions for microformats: >>> def get_text(el, class_name):... els = nd_class(class_name)... if els:... return els[0]. text_content()... else:... return '' >>> def get_value(el):... return get_text(el, 'value') or el. text_content() >>> def get_all_texts(el, class_name):... return [e. text_content() for e in nd_class(class_name)] >>> def parse_addresses(el):... # Ideally this would parse street, etc.... return nd_class('adr') Then the parsing: >>> for el in nd_class('hcard'):... card = Card()... = el... = get_text(el, 'fn')... = []... for tel_el in nd_class('tel'):... (Phone(get_value(tel_el),... get_all_texts(tel_el, 'type')))... dresses = parse_addresses(el) <img decoding="async" src="" alt="HTML Scraping - The Hitchhiker's Guide to Python" title="HTML Scraping - The Hitchhiker's Guide to Python" /> <h2>HTML Scraping - The Hitchhiker's Guide to Python</h2> Web Scraping¶ Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from them and preserve the structure while we’re at it. Web sites don’t always provide their data in comfortable formats such as CSV or JSON. This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. lxml and Requests¶ lxml is a pretty extensive library written for parsing XML and HTML documents very quickly, even handling messed up tags in the process. We will also be using the Requests module instead of the already built-in urllib2 module due to improvements in speed and readability. You can easily install both using pip install lxml and pip install requests. Let’s start with the imports: from lxml import html import requests Next we will use to retrieve the web page with our data, parse it using the html module, and save the results in tree: page = ('') tree = omstring(ntent) (We need to use ntent rather than because omstring implicitly expects bytes as input. ) tree now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect. In this example, we will focus on the former. XPath is a way of locating information in structured documents such as HTML or XML documents. A good introduction to XPath is on W3Schools. There are also various tools for obtaining the XPath of elements such as FireBug for Firefox or the Chrome Inspector. If you’re using Chrome, you can right click an element, choose ‘Inspect element’, highlight the code, right click again, and choose ‘Copy XPath’. After a quick analysis, we see that in our page the data is contained in two elements – one is a div with title ‘buyer-name’ and the other is a span with class ‘item-price’: <div title="buyer-name">Carson Busses</div> $29. 95 Knowing this we can create the correct XPath query and use the lxml xpath function like this: #This will create a list of buyers: buyers = ('//div[@title="buyer-name"]/text()') #This will create a list of prices prices = ('//span[@class="item-price"]/text()') Let’s see what we got exactly: print('Buyers: ', buyers) print('Prices: ', prices) Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell'] Prices: ['$29. 95', '$8. 37', '$15. 26', '$19. 25', '$19. 25', '$13. 99', '$31. 57', '$8. 49', '$14. 47', '$15. 86', '$11. 11', '$15. 98', '$16. 27', '$7. 50', '$50. 85', '$14. 26', '$5. 68', '$15. 00', '$114. 07', '$10. 09'] Congratulations! We have successfully scraped all the data we wanted from a web page using lxml and Requests. We have it stored in memory as two lists. Now we can do all sorts of cool stuff with it: we can analyze it using Python or we can save it to a file and share it with the world. Some more cool ideas to think about are modifying this script to iterate through the rest of the pages of this example dataset, or rewriting this application to use threads for improved speed. <img decoding="async" src="" alt="Web Scraping with lxml: What you need to know" title="Web Scraping with lxml: What you need to know" /> <h2>Web Scraping with lxml: What you need to know</h2> Why should you bother learning how to web scrape? If your job doesn't require you to learn it, then let me give you some motivation. What if you want to create a website which curates the cheapest products from Amazon, Walmart and a couple of other online stores? A lot of these online stores don't provide you with an easy way to access their information using an API. In the absence of an API, your only choice is to create a web scraper which can extract information from these websites automatically and provide you with that information in an easy to use is a guest post brought to you by your friends @ Timber. If you're interested in writing for us, reach out on is an example of a typical API response in JSON. This is the response from Reddit:There are a lot of Python libraries out there which can help you with web scraping. There is lxml, BeautifulSoup and a full-fledged framework called Scrapy. Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with lxml in this post. I will teach you the basics of XPaths and how you can use them to extract data from an HTML document. I will take you through a couple of different examples so that you can quickly get up-to-speed with lxml and you are a gamer, you will already know of (and likely love) this website. We will be trying to extract data from Steam. More specifically, we will be selecting from the "popular new releases" information. I am converting this into a two-part series. In this part, we will be creating a Python script which can extract the names of the games, the prices of the games, the different tags associated with each game and the target platforms. In the second part, we will turn this script into a Flask based API and then host it on 1: Exploring SteamFirst of all, open up the "popular new releases" page on Steam and scroll down until you see the Popular New Releases tab. At this point, I usually open up Chrome developer tools and see which HTML tags contain the required data. I extensively use the element inspector tool (The button in the top left of the developer tools). It allows you to see the HTML markup behind a specific element on the page with just one click. As a high-level overview, everything on a web page is encapsulated in an HTML tag and tags are usually nested. You need to figure out which tags you need to extract the data from and you are good to go. In our case, if we take a look, we can see that every separate list item is encapsulated in an anchor (a) anchor tags themselves are encapsulated in the div with an id of tab_newreleases_content. I am mentioning the id because there are two tabs on this page. The second tab is the standard "New Releases" tab, and we don't want to extract information from that tab. Hence, we will first extract the "Popular New Releases" tab, and then we will extract the required information from this 2: Start writing a Python scriptThis is a perfect time to create a new Python file and start writing down our script. I am going to create a file. Now let's go ahead and import the required libraries. The first one is the requests library and the second one is the library. 1 2 import requests import If you don't have requests installed, you can easily install it by running this command in the terminal:1 $ pip install requests The requests library is going to help us open the web page in Python. We could have used lxml to open the HTML page as well but it doesn't work well with all web pages so to be on the safe side I am going to use let's open up the web page using requests and pass that response to html = ('') doc = (ntent) This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document. Step 3: Fire up the Python InterpreterNow save this file and open up a terminal. Copy the code from the file and paste it in a Python interpreter are doing this so that we can quickly test our XPaths without continuously editing, saving and executing our 's try writing an XPath for extracting the div which contains the 'Popular New Releases' tab. I will explain the code as we go along:1 new_releases = ('//div[@id="tab_newreleases_content"]')[0] This statement will return a list of all the divs in the HTML page which have an id of tab_newreleases_content. Now because we know that only one div on the page has this id we can take out the first element from the list ([0]) and that would be our required div. Let's break down the xpath and try to understand it these double forward slashes tell lxml that we want to search for all tags in the HTML document which match our requirements/filters. Another option was to use / (a single forward slash). The single forward slash returns only the immediate child tags/nodes which match our requirements/filtersdiv tells lxml that we are searching for divs in the HTML page[@id="tab_newreleases_content"] tells lxml that we are only interested in those divs which have an id of tab_newreleases_contentCool! We have got the required div. Now let's go back to chrome and check which tag contains the titles of the 4: Extract the titles & pricesThe title is contained in a div with a class of tab_item_name. Now that we have the "Popular New Releases" tab extracted we can run further XPath queries on that tab. Write down the following code in the same Python console which we previously ran our code in:1 titles = ('. //div[@class="tab_item_name"]/text()') This gives us with the titles of all of the games in the "Popular New Releases" tab. Here is the expected output:Let's break down this XPath a little bit because it is a bit different from the last one.. tells lxml that we are only interested in the tags which are the children of the new_releases tag[@class="tab_item_name"] is pretty similar to how we were filtering divs based on id. The only difference is that here we are filtering based on the class name/text() tells lxml that we want the text contained within the tag we just extracted. In this case, it returns the title contained in the div with the tab_item_name class nameNow we need to extract the prices for the games. We can easily do that by running the following code:1 prices = ('. //div[@class="discount_final_price"]/text()') I don't think I need to explain this code as it is pretty similar to the title extraction code. The only change we made is the change in the class 5: Extracting tagsNow we need to extract the tags associated with the titles. Here is the HTML markup:Write down the following code in the Python terminal to extract the tags:1 3 4 tags = ('. //div[@class="tab_item_top_tags"]') total_tags = [] for tag in tags: (tag. text_content()) So what we are doing here is that we are extracting the divs containing the tags for the games. Then we loop over the list of extracted tags and then extract the text from those tags using the text_content() method. text_content() returns the text contained within an HTML tag without the HTML We could have also made use of a list comprehension to make that code shorter. I wrote it down in this way so that even those who don't know about list comprehensions can understand the code. Eitherways, this is the alternate code:1 tags = [tag. text_content() for tag in ('. //div[@class="tab_item_top_tags"]')] Lets separate the tags in a list as well so that each tag is a separate element:1 tags = [(', ') for tag in tags] Step 6: Extracting the platformsNow the only thing remaining is to extract the platforms associated with each title. Here is the HTML markup:The major difference here is that the platforms are not contained as texts within a specific tag. They are listed as the class name. Some titles only have one platform associated with them like this:1 While some titles have 5 platforms associated with them like this:1 5 6 As we can see these spans contain the platform type as the class name. The only common thing between these spans is that all of them contain the platform_img class. First of all, we will extract the divs with the tab_item_details class, then we will extract the spans containing the platform_img class and finally we will extract the second class name from those spans. Here is the code:1 7 8 9 platforms_div = ('. //div[@class="tab_item_details"]') total_platforms = [] for game in platforms_div: temp = ('. //span[contains(@class, "platform_img")]') platforms = [('class')(' ')[-1] for t in temp] if 'hmd_separator' in platforms: ('hmd_separator') (platforms) In line 1 we start with extracting the tab_item_details div. The XPath in line 5 is a bit different. Here we have [contains(@class, "platform_img")] instead of simply having [@class="platform_img"]. The reason is that [@class="platform_img"] returns those spans which only have the platform_img class associated with them. If the spans have an additional class, they won't be returned. Whereas [contains(@class, "platform_img")] filters all the spans which have the platform_img class. It doesn't matter whether it is the only class or if there are more classes associated with that line 6 we are making use of a list comprehension to reduce the code size. The () method allows us to extract an attribute of a tag. Here we are using it to extract the class attribute of a span. We get a string back from the () method. In case of the first game, the string being returned is platform_img win so we split that string based on the comma and the whitespace, and then we store the last part (which is the actual platform name) of the split string in the lines 7-8 we are removing the hmd_separator from the list if it exists. This is because hmd_separator is not a platform. It is just a vertical separator bar used to separate actual platforms from VR/AR 7: ConclusionThis is the code we have so far:1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Now we just need this to return a JSON response so that we can easily turn this into a Flask based API. Here is the code:1 output = [] for info in zip(titles, prices, tags, total_platforms): resp = {} resp['title'] = info[0] resp['price'] = info[1] resp['tags'] = info[2] resp['platforms'] = info[3] (resp) This code is self-explanatory. We are using the zip function to loop over all of those lists in parallel. Then we create a dictionary for each game and assign the title, price, tags, and platforms as a separate key in that dictionary. Lastly, we append that dictionary to the output a future post, we will take a look at how we can convert this into a Flask based API and host it on post was written by Yasoob from Python Tips. I hope you guys enjoyed this tutorial. If you want to read more tutorials of a similar nature, please go to Python Tips. I regularly write Python tips, tricks, and tutorials on that blog. And if you are interested in learning intermediate Python, then please check out my open source book a great day! Just a disclaimer: we're a logging company here @ Timber. We'd love it if you tried out our product (it's seriously great! ), but that's all we're going to advertise it. <h2>Frequently Asked Questions about html_fromstring</h2> <h3>What is HTML Fromstring Python?</h3> Description. Parse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. <h3>What does HTML Fromstring do?</h3> fromstring . This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document. <h3>What is lxml HTML?</h3> lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML). <div class="post-tags"> <a href="#"></a> </div> <div class="post-navigation"> <div class="post-prev"> <a href="https://proxyboys.net/port-format/"> <div class="postnav-image"> <div class="overlay"></div> <div class="navprev"> <img width="225" height="225" src="https://proxyboys.net/wp-content/uploads/2021/11/images-251.png" class="attachment-post-thumbnail size-post-thumbnail wp-post-image" alt="" decoding="async" fetchpriority="high" srcset="https://proxyboys.net/wp-content/uploads/2021/11/images-251.png 225w, https://proxyboys.net/wp-content/uploads/2021/11/images-251-150x150.png 150w" sizes="(max-width: 225px) 100vw, 225px" /> </div> </div> <div class="prev-post-title"> <a href="https://proxyboys.net/port-format/" rel="prev">Port Format</a> </div> </a> </div> <div class="post-next"> <a href="https://proxyboys.net/playing-maplestory/"> <div class="postnav-image"> <div class="overlay"></div> <div class="navnext"> </div> </div> <div class="next-post-title"> <a href="https://proxyboys.net/playing-maplestory/" rel="next">Playing Maplestory</a> </div> </a> </div> </div> </div> </div> <div id="comments" class="comments-area"> <div id="respond" class="comment-respond"> <h3 id="reply-title" class="comment-reply-title">Leave a Reply <a rel="nofollow" id="cancel-comment-reply-link" href="/html_fromstring/#respond" style="display:none;">Cancel reply</a></h3><form action="https://proxyboys.net/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate>Your email address will not be published. Required fields are marked *<label for="comment">Comment *</label> <textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" required>

Name *

Email *

Website