• November 16, 2022

Lxml Xpath Python

HTTP & SOCKS Rotating & Static Proxies

  • 72 million IPs for all purposes
  • Worldwide locations
  • 3 day moneyback guarantee

Visit brightdata.com

XPath and XSLT with lxml

XPath and XSLT with lxml

lxml supports XPath 1. 0, XSLT 1. 0 and the EXSLT extensions through
libxml2 and libxslt in a standards compliant way.
supports the simple path syntax of the find, findall and
findtext methods on ElementTree and Element, as known from the original
ElementTree library (ElementPath). As an lxml specific extension, these
classes also provide an xpath() method that supports expressions in the
complete XPath syntax, as well as custom extension functions.
There are also specialized XPath evaluator classes that are more efficient for
frequent evaluation: XPath and XPathEvaluator. See the performance
comparison to learn when to use which. Their semantics when used on
Elements and ElementTrees are the same as for the xpath() method described
here.
Note that the *() methods are usually faster than the full-blown XPath
support. They also support incremental tree processing through the. iterfind()
method, whereas XPath always collects all results before returning them.
The xpath() method
For ElementTree, the xpath method performs a global XPath query against the
document (if absolute) or against the root node (if relative):
>>> f = StringIO(‘‘)
>>> tree = (f)
>>> r = (‘/foo/bar’)
>>> len(r)
1
>>> r[0]
‘bar’
>>> r = (‘bar’)
When xpath() is used on an Element, the XPath expression is evaluated
against the element (if relative) or against the root tree (if absolute):
>>> root = troot()
>>> bar = root[0]
>>> tree = troottree()
The xpath() method has support for XPath variables:
>>> expr = “//*[local-name() = $name]”
>>> print((expr, name = “foo”)[0])
foo
>>> print((expr, name = “bar”)[0])
bar
>>> print((“$text”, text = “Hello World! “))
Hello World!
Namespaces and prefixes
If your XPath expression uses namespace prefixes, you must define them
in a prefix mapping. To this end, pass a dictionary to the
namespaces keyword argument that maps the namespace prefixes used
in the XPath expression to namespace URIs:
>>> f = StringIO(”’… Text… ”’)
>>> doc = (f)
>>> r = (‘/x:foo/b:bar’,… namespaces={‘x’: ”,… ‘b’: ”})
‘{bar’
‘Text’
The prefixes you choose here are not linked to the prefixes used
inside the XML document. The document may define whatever prefixes it
likes, including the empty prefix, without breaking the above code.
Note that XPath does not have a notion of a default namespace. The
empty prefix is therefore undefined for XPath and cannot be used in
namespace prefix mappings.
There is also an optional extensions argument which is used to
define custom extension functions in Python that are local to this
evaluation. The namespace prefixes that they use in the XPath
expression must also be defined in the namespace prefix mapping.
XPath return values
The return value types of XPath evaluations vary, depending on the
XPath expression used:
True or False, when the XPath expression has a boolean result
a float, when the XPath expression has a numeric result (integer or float)
a ‘smart’ string (as described below), when the XPath expression has
a string result.
a list of items, when the XPath expression has a list as result.
The items may include Elements (also comments and processing
instructions), strings and tuples. Text nodes and attributes in the
result are returned as ‘smart’ string values. Namespace
declarations are returned as tuples of strings: (prefix, URI).
XPath string results are ‘smart’ in that they provide a
getparent() method that knows their origin:
for attribute values, tparent() returns the Element
that carries them. An example is //foo/@attribute, where the
parent would be a foo Element.
for the text() function (as in //text()), it returns the
Element that contains the text or tail that was returned.
You can distinguish between different text origins with the boolean
properties is_text, is_tail and is_attribute.
Note that getparent() may not always return an Element. For
example, the XPath functions string() and concat() will
construct strings that do not have an origin. For them,
getparent() will return None.
There are certain cases where the smart string behaviour is
undesirable. For example, it means that the tree will be kept alive
by the string, which may have a considerable memory impact in the case
that the string value is the only thing in the tree that is actually
of interest. For these cases, you can deactivate the parental
relationship using the keyword argument smart_strings.
>>> root = (“TEXT“)
>>> find_text = (“//text()”)
>>> text = find_text(root)[0]
>>> print(text)
TEXT
>>> print(tparent())
>>> find_text = (“//text()”, smart_strings=False)
>>> hasattr(text, ‘getparent’)
False
Generating XPath expressions
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element:
>>> a = etree. Element(“a”)
>>> b = bElement(a, “b”)
>>> c = bElement(a, “c”)
>>> d1 = bElement(c, “d”)
>>> d2 = bElement(c, “d”)
>>> tree = etree. ElementTree(c)
>>> print(tpath(d2))
/c/d[2]
>>> (tpath(d2)) == [d2]
True
The XPath class
The XPath class compiles an XPath expression into a callable function:
>>> root = (““)
>>> find = (“//b”)
>>> print(find(root)[0])
b
The compilation takes as much time as in the xpath() method, but it is
done only once per class instantiation. This makes it especially efficient
for repeated evaluation of the same XPath expression.
Just like the xpath() method, the XPath class supports XPath
variables:
>>> count_elements = (“count(//*[local-name() = $name])”)
>>> print(count_elements(root, name = “a”))
1. 0
>>> print(count_elements(root, name = “b”))
2. 0
This supports very efficient evaluation of modified versions of an XPath
expression, as compilation is still only required once.
Prefix-to-namespace mappings can be passed as second parameter:
>>> root = (““)
>>> find = (“//n:b”, namespaces={‘n’:’NS’})
{NS}b
Regular expressions in XPath
By default, XPath supports regular expressions in the EXSLT namespace:
>>> regexpNS = ”
>>> find = (“//*[re:test(., ‘^abc$’, ‘i’)]”,… namespaces={‘re’:regexpNS})
>>> root = (“aBaBc“)
aBc
You can disable this with the boolean keyword argument regexp which
defaults to True.
The XPathEvaluator classes
provides two other efficient XPath evaluators that work on
ElementTrees or Elements respectively: XPathDocumentEvaluator and
XPathElementEvaluator. They are automatically selected if you use the
XPathEvaluator helper for instantiation:
>>> xpatheval = etree. XPathEvaluator(root)
>>> print(isinstance(xpatheval, etree. XPathElementEvaluator))
>>> print(xpatheval(“//b”)[0])
This class provides efficient support for evaluating different XPath
expressions on the same Element or ElementTree.
ETXPath
ElementTree supports a language named ElementPath in its find*() methods.
One of the main differences between XPath and ElementPath is that the XPath
language requires an indirection through prefixes for namespace support,
whereas ElementTree uses the Clark notation ({ns}name) to avoid prefixes
completely. The other major difference regards the capabilities of both path
languages. Where XPath supports various sophisticated ways of restricting the
result set through functions and boolean expressions, ElementPath only
supports pure path traversal without nesting or further conditions. So, while
the ElementPath syntax is self-contained and therefore easier to write and
handle, XPath is much more powerful and expressive.
bridges this gap through the class ETXPath, which accepts XPath
expressions with namespaces in Clark notation. It is identical to the
XPath class, except for the namespace notation. Normally, you would
write:
>>> root = (““)
>>> find = (“//p:b”, namespaces={‘p’: ‘ns’})
{ns}b
ETXPath allows you to change this to:
>>> find = XPath(“//{ns}b”)
Error handling
raises exceptions when errors occur while parsing or evaluating an
XPath expression:
>>> find = (“\”)
Traceback (most recent call last):…
Invalid expression
lxml will also try to give you a hint what went wrong, so if you pass a more
complex expression, you may get a somewhat more specific error:
>>> find = (“//*[1. 1. 1]”)
Invalid predicate
During evaluation, lxml will emit an XPathEvalError on errors:
>>> find = (“//ns:a”)
>>> find(root)
Undefined namespace prefix
This works for the XPath class, however, the other evaluators (including
the xpath() method) are one-shot operations that do parsing and evaluation
in one step. They therefore raise evaluation exceptions in all cases:
>>> root = etree. Element(“test”)
Note that lxml versions before 1. 3 always raised an XPathSyntaxError for
all errors, including evaluation errors. The best way to support older
versions is to except on the superclass XPathError.
introduces a new class, The class can be
given an ElementTree or Element object to construct an XSLT
transformer:
>>> xslt_root = (”’… … … ”’)
>>> transform = (xslt_root)
You can then run the transformation on an ElementTree document by simply
calling it, and this results in another ElementTree object:
>>> f = StringIO(‘Text‘)
>>> result_tree = transform(doc)
By default, XSLT supports all extension functions from libxslt and
libexslt as well as Python regular expressions through the EXSLT
regexp functions. Also see the documentation on custom extension
functions, XSLT extension elements and document resolvers.
There is a separate section on controlling access to external
documents and resources.
XSLT result objects
The result of an XSL transformation can be accessed like a normal ElementTree
document:
>>> root = (‘Text‘)
>>> result = transform(root)
>>> troot()
but, as opposed to normal ElementTree objects, can also be turned into an (XML
or text) string by applying the bytes() function (str() in Python 2):
>>> bytes(result)
b’nTextn’
The result is always a plain string, encoded as requested by the xsl:output
element in the stylesheet. If you want a Python Unicode/Text string instead,
you should set this encoding to UTF-8 (unless the ASCII default
is sufficient). This allows you to call the builtin str() function on
the result (unicode() in Python 2):
>>> str(result)
u’nTextn’
You can use other encodings at the cost of multiple recoding. Encodings that
are not supported by Python will result in an error:
>>> xslt_tree = (”’… … ”’)
>>> transform = (xslt_tree)
>>> result = transform(doc)
LookupError: unknown encoding: UCS4
While it is possible to use the () method (known from ElementTree
objects) to serialise the XSLT result into a file, it is better to use the. write_output() method. The latter knows about the tag
and writes the expected data into the output file.
>>> xslt_root = (”’… … ”’)
>>> result. write_output(“”, compression=9) # doctest: +SKIP
>>> from io import BytesIO
>>> out = BytesIO()
>>> result. write_output(out)
>>> data = tvalue()
>>> b’Text’ in data
Stylesheet parameters
It is possible to pass parameters, in the form of XPath expressions, to the
XSLT template:
>>> xslt_tree = (”’… … ”’)
>>> doc_root = (‘Text‘)
The parameters are passed as keyword parameters to the transform call.
First, let’s try passing in a simple integer expression:
>>> result = transform(doc_root, a=”5″)
b’n5n’
You can use any valid XPath expression as parameter value:
>>> result = transform(doc_root, a=”/a/b/text()”)
It’s also possible to pass an XPath object as a parameter:
>>> result = transform(doc_root, (“/a/b/text()”))
Passing a string expression looks like this:
>>> result = transform(doc_root, a=”‘A'”)
b’nAn’
To pass a string that (potentially) contains quotes, you can use the. strparam() class method. Note that it does not escape the
string. Instead, it returns an opaque object that keeps the string
value.
>>> plain_string_value = (… “”” It’s “Monty Python” “””)
>>> result = transform(doc_root, a=plain_string_value)
b’n It’s “Monty Python” n’
If you need to pass parameters that are not legal Python identifiers,
pass them inside of a dictionary:
>>> transform = ((”’… … ”’))
>>> result = transform(doc_root, **{‘non-python-identifier’: ‘5’})
Errors and messages
Like most of the processing oriented objects in, XSLT
provides an error log that lists messages and error output from the
last run. See the parser documentation for a description of the
error log.
>>> xslt_root = (”’… STARTINGDONE… ”’)
>>> result = transform(doc_root)
>>> print(ror_log)
:0:0:ERROR:XSLT:ERR_OK: STARTING
:0:0:ERROR:XSLT:ERR_OK: DONE
>>> for entry in ror_log:… print(‘message from line%s, col%s:%s’% (…,, ssage))… print(‘domain:%s (%d)’% (main_name, ))… print(‘type:%s (%d)’% (entry. type_name, ))… print(‘level:%s (%d)’% (entry. level_name, ))… print(‘filename:%s’% lename)
message from line 0, col 0: STARTING
domain: XSLT (22)
type: ERR_OK (0)
level: ERROR (2)
filename:
message from line 0, col 0: DONE
Note that there is no way in XSLT to distinguish between user
messages, warnings and error messages that occurred during the
run. libxslt simply does not provide this information. You can
partly work around this limitation by making your own messages
uniquely identifiable, e. g. with a common text prefix.
The xslt() tree method
There’s also a convenience method on ElementTree objects for doing XSL
transformations. This is less efficient if you want to apply the same XSL
transformation to multiple documents, but is shorter to write for one-shot
operations, as you do not have to instantiate a stylesheet yourself:
>>> result = (xslt_tree, a=”‘A'”)
This is a shortcut for the following code:
>>> result = transform(doc, a=”‘A'”)
Dealing with stylesheet complexity
Some applications require a larger set of rather diverse stylesheets.
allows you to deal with this in a number of ways. Here are
some ideas to try.
The most simple way to reduce the diversity is by using XSLT
parameters that you pass at call time to configure the stylesheets.
The partial() function in the functools module
may come in handy here. It allows you to bind a set of keyword
arguments (i. e. stylesheet parameters) to a reference of a callable
stylesheet. The same works for instances of the XPath()
evaluator, obviously.
You may also consider creating stylesheets programmatically. Just
create an XSL tree, e. from a parsed template, and then add or
replace parts as you see fit. Passing an XSL tree into the XSLT()
constructor multiple times will create independent stylesheets, so
later modifications of the tree will not be reflected in the already
created stylesheets. This makes stylesheet generation very straight
forward.
A third thing to remember is the support for custom extension
functions and XSLT extension elements. Some things are much
easier to express in XSLT than in Python, while for others it is the
complete opposite. Finding the right mixture of Python code and XSL
code can help a great deal in keeping applications well designed and
maintainable.
Profiling
If you want to know how your stylesheet performed, pass the profile_run
keyword to the transform:
>>> result = transform(doc, a=”/a/b/text()”, profile_run=True)
>>> profile = result. xslt_profile
The value of the xslt_profile property is an ElementTree with profiling
data about each template, similar to the following: