Hello
World
‘
>>> string(root, method=’xml’) # same as above
>>> string(root, method=’html’)
b’
Hello
World
‘
>>> print(string(root, method=’html’, pretty_print=True))
Hello
World
>>> string(root, method=’text’)
b’HelloWorld’
As for XML serialisation, the default encoding for plain text
serialisation is ASCII:
>>> br = (‘. //br’)
>>> = u’Wxf6rld’
>>> string(root, method=’text’) # doctest: +ELLIPSIS
Traceback (most recent call last):…
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’xf6’…
>>> string(root, method=’text’, encoding=”UTF-8″)
b’HelloWxc3xb6rld’
Here, serialising to a Python unicode string instead of a byte string
might become handy. Just pass the unicode type as encoding:
>>> string(root, encoding=unicode, method=’text’)
u’HelloWxf6rld’
An ElementTree is mainly a document wrapper around a tree with a
root node. It provides a couple of methods for parsing, serialisation
and general document handling. One of the bigger differences is that
it serialises as a complete document, as opposed to a single
Element. This includes top-level processing instructions and
comments, as well as a DOCTYPE and other DTD content in the document:
>>> tree = (StringIO(”’… xml version="1. 0"? >… ]>…
>>> print(ctype)
>>> # lxml 1. 3. 4 and later
>>> print(string(tree))
]>
eggs
>>> print(string(etree. ElementTree(troot())))
>>> # ElementTree and lxml <= 1. 3
>>> print(string(troot()))
Note that this has changed in lxml 1. 4 to match the behaviour of
lxml 2. 0. Before, the examples were serialised without DTD content,
which made lxml loose DTD information in an input-output cycle.
supports parsing XML in a number of ways and from all
important sources, namely strings, files, URLs (/ftp) and
file-like objects. The main parse functions are fromstring() and
parse(), both called with the source as first argument. By
default, they use the standard parser, but you can always pass a
different parser as second argument.
The fromstring() function is the easiest way to parse a string:
>>> some_xml_data = “
>>> root = omstring(some_xml_data)
root
b’
The XML() function behaves like the fromstring() function, but is
commonly used to write XML literals right into the source:
>>> root = (“
The parse() function is used to parse from files and file-like objects:
>>> some_file_like = StringIO(“
>>> tree = (some_file_like)
>>> string(tree)
Note that parse() returns an ElementTree object, not an Element object as
the string parser functions:
>>> root = troot()
The reasoning behind this difference is that parse() returns a
complete document from a file, while the string parsing functions are
commonly used to parse XML fragments.
The parse() function supports any of the following sources:
an open file object
a file-like object that has a (byte_count) method returning
a byte string on each call
a filename string
an HTTP or FTP URL string
Note that passing a filename or URL is usually faster than passing an
open file.
By default, uses a standard parser with a default setup. If
you want to configure the parser, you can create a you instance:
>>> parser = etree. XMLParser(remove_blank_text=True) # only!
This creates a parser that removes empty text between tags while parsing,
which can reduce the size of the tree and avoid dangling tail text if you know
that whitespace-only content is not meaningful for your data. An example:
>>> root = (“
b’
Note that the whitespace content inside the tag was not removed, as
content at leaf elements tends to be data content (even if blank). You can
easily remove it in an additional step by traversing the tree:
>>> for element in (“*”):… if is not None and not ():… = None
b’
See help(etree. XMLParser) to find out about the available parser options.
provides two ways for incremental step-by-step parsing. One is
through file-like objects, where it calls the read() method repeatedly.
This is best used where the data arrives from a source like urllib or any
other file-like object that can provide data on request. Note that the parser
will block and wait until data becomes available in this case:
>>> class DataSource:… data = [ b”
>>> tree = (DataSource())
b’
The second way is through a feed parser interface, given by the feed(data)
and close() methods:
>>> parser = etree. XMLParser()
>>> (“
>>> (“><")
>>> (“/root>”)
>>> root = ()
Here, you can interrupt the parsing process at any time and continue it later
on with another call to the feed() method. This comes in handy if you
want to avoid blocking calls to the parser, e. g. in frameworks like Twisted,
or whenever data comes in slowly or in chunks and you want to do other things
while waiting for the next chunk.
After calling the close() method (or when an exception was raised
by the parser), you can reuse the parser by calling its feed()
method again:
>>> (“
b’
Sometimes, all you need from a document is a small fraction somewhere deep
inside the tree, so parsing the whole tree into memory, traversing it and
dropping it can be too much overhead. supports this use case
with two event-driven parser interfaces, one that generates parser events
while building the tree (iterparse), and one that does not build the tree
at all, and instead calls feedback methods on a target object in a SAX-like
fashion.
Here is a simple iterparse() example:
>>> some_file_like = StringIO(“
>>> for event, element in erparse(some_file_like):… print(“%s, %4s, %s”% (event,, ))
end, a, data
end, root, None
By default, iterparse() only generates events when it is done parsing an
element, but you can control this through the events keyword argument:
>>> for event, element in erparse(some_file_like,… events=(“start”, “end”)):… print(“%5s, %4s, %s”% (event,, ))
start, root, None
start, a, data
Note that the text, tail and children of an Element are not necessarily there
yet when receiving the start event. Only the end event guarantees
that the Element has been parsed completely.
It also allows to () or modify the content of an Element to
save memory. So if you parse a large tree and you want to keep memory
usage small, you should clean up parts of the tree that you no longer
need:
>>> some_file_like = StringIO(… “
>>> for event, element in erparse(some_file_like):… if == ‘b’:… print()… elif == ‘a’:… print(“** cleaning up the subtree”)… ()
data
** cleaning up the subtree
If memory is a real bottleneck, or if building the tree is not desired at all,
the target parser interface of can be used. It creates
SAX-like events by calling the methods of a target object. By implementing
some or all of these methods, you can control which events are generated:
>>> class ParserTarget:… events = []… def start(self, tag, attrib):… ((“start”, tag, attrib))… def close(self):… return
>>> parser = etree. XMLParser(target=ParserTarget())
>>> events = omstring(‘
>>> for event in events:… print(‘event:%s – tag:%s’% (event[0], event[1]))… for attr, value in event[2]():… print(‘ *%s =%s’% (attr, value))
event: start – tag: root
* test = true
The ElementTree API avoids namespace prefixes wherever possible and deploys
the real namespaces instead:
>>> xhtml = etree. Element(“{html”)
>>> body = bElement(xhtml, “{body”)
>>> = “Hello World”
>>> print(string(xhtml, pretty_print=True))
>>> print((“bgcolor”))
>>> (XHTML + “bgcolor”)
‘#CCFFAA’
You can also use XPath in this way:
>>> find_xhtml_body = XPath( # lxml only!… “//{%s}body”% XHTML_NAMESPACE)
>>> results = find_xhtml_body(xhtml)
>>> print(results[0])
{body
The E-factory provides a simple and compact syntax for generating XML and
HTML:
>>> from er import E
>>> def CLASS(*args): # class is a reserved word in Python… return {“class”:’ ‘(args)}
>>> html = page = (… ( # create an Element called “html”… (… (“This is a sample document”)… ),… E. h1(“Hello! “, CLASS(“title”)),… p(“This is a paragraph with “, E. b(“bold”), ” text in it! “),… p(“This is another paragraph, with a”, “n “,… a(“link”, href=”), “. “),… p(“Here are some reservered characters:
And finally an embedded XHTML fragment.
“),… )… )
>>> print(string(page, pretty_print=True))
Hello!
This is a paragraph with bold text in it!
This is another paragraph, with a
vocabulary for HTML.
The ElementTree library comes with a simple XPath-like path language
called ElementPath. The main difference is that you can use the
{namespace}tag notation in ElementPath expressions. However,
advanced features like value comparison and functions are not
available.
In addition to a full XPath implementation, supports the
ElementPath language in the same way ElementTree does, even using
(almost) the same implementation. The API provides four methods here
that you can find on Elements and ElementTrees:
iterfind() iterates over all Elements that match the path
expression
findall() returns a list of matching Elements
find() efficiently returns only the first match
findtext() returns the content of the first match
Here are some examples:
>>> root = (“
Find a child of an Element:
>>> print((“b”))
>>> print((“a”))
a
Find an Element anywhere in the tree:
>>> print((“. //b”))
b
>>> [ for b in erfind(“. //b”)]
[‘b’, ‘b’]
Find Elements with a certain attribute:
>>> print(ndall(“. //a[@x]”)[0])
>>> print(ndall(“. //a[@y]”))
[]
Frequently Asked Questions about python lxml etree
What is lxml Etree?
lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
What is lxml module in Python?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.Apr 10, 2019
How do you use Etree in Python?
Add them using Subelement() function and define it’s text attribute.child=xml. Element(“employee”) nm = xml. SubElement(child, “name”) nm. text = student. … import xml. etree. ElementTree as et tree = et. ElementTree(file=’employees.xml’) root = tree. … import xml. etree. ElementTree as et tree = et.Jan 16, 2019