• November 10, 2024

Html_Fromstring

Parsing HTML - lxml

Parsing HTML – lxml

Author:
Ian Bicking
Since version 2. 0, lxml comes with a dedicated Python package for
dealing with HTML: It is based on lxml’s HTML parser,
but provides a special Element API for HTML elements, as well as a
number of utilities for common HTML processing tasks.
Contents
Parsing HTML
Parsing HTML fragments
Really broken pages
HTML Element Methods
Running HTML doctests
Creating HTML with the E-factory
Viewing your HTML
Working with links
Functions
Forms
Form Filling Example
Form Submission
Cleaning up HTML
autolink
wordwrap
HTML Diff
Examples
Microformat Example
The main API is based on the API, and thus, on the ElementTree
API.
There are several functions available to parse HTML:
parse(filename_url_or_file):
Parses the named file or url, or if the object has a ()
method, parses from that.
If you give a URL, or if the object has a () method (as
file-like objects from urllib. urlopen() have), then that URL
is used as the base URL. You can also provide an explicit
base_url keyword argument.
document_fromstring(string):
Parses a document from the given string. This always creates a
correct HTML document, which means the parent node is ,
and there is a body and possibly a head.
fragment_fromstring(string, create_parent=False):
Returns an HTML fragment from a string. The fragment must contain
just a single element, unless create_parent is given;
e. g., fragment_fromstring(string, create_parent=’div’) will
wrap the element in a

.
fragments_fromstring(string):
Returns a list of the elements found in the fragment.
fromstring(string):
Returns document_fromstring or fragment_fromstring, based
on whether the string looks like a full document, or just a
fragment.
The normal HTML parser is capable of handling broken HTML, but for
pages that are far enough from HTML to call them ‘tag soup’, it may
still fail to parse the page in a useful way. A way to deal with this
is ElementSoup, which deploys the well-known BeautifulSoup parser to
build an lxml HTML tree.
However, note that the most common problem with web pages is the lack
of (or the existence of incorrect) encoding declarations. It is
therefore often sufficient to only use the encoding detection of
BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml’s
own HTML parser, which is several times faster.
HTML elements have all the methods that come with ElementTree, but
also include some extra methods:. drop_tree():
Drops the element and all its children. Unlike
tparent()(el) this does not remove the tail
text; with drop_tree the tail text is merged with the previous
element.. drop_tag():
Drops the tag, but keeps its children and text.. find_class(class_name):
Returns a list of all the elements with the given CSS class name.
Note that class names are space separated in HTML, so
nd_class_name(‘highlight’) will find an element like