• May 2, 2024

Xpath Python Html

Parse HTML via XPath [closed] – Stack Overflow

In, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I’ve used this for a couple years in my sites, but I’ve had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?
jfs359k160 gold badges896 silver badges1551 bronze badges
asked Nov 13 ’08 at 1:05
Tristan HavelickTristan Havelick60. 9k19 gold badges53 silver badges64 bronze badges
I’m surprised there isn’t a single mention of lxml. It’s blazingly fast and will work in any environment that allows CPython libraries.
Here’s how you can parse HTML via XPATH using lxml.
>>> from lxml import etree
>>> doc = ‘
>>> tree = (doc)
>>> r = (‘/foo/bar’)
>>> len(r)
1
>>> r[0]
‘bar’
>>> r = (‘bar’)
answered Jan 20 ’11 at 12:24
Jagtesh ChadhaJagtesh Chadha2, 5122 gold badges21 silver badges30 bronze badges
0
In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:
>>> from elementtidy. TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> (“

Hello world”)
>>> e= ()
>>> (“. //{p”)

answered Nov 14 ’08 at 3:37
Aaron MaenpaaAaron Maenpaa109k10 gold badges92 silver badges107 bronze badges
2
The most stable results I’ve had have been using ‘s soupparser. You’ll need to install python-lxml and python-beautifulsoup, then you can do the following:
from import fromstring
tree = fromstring(‘here! ‘)
matches = (“. /mal[@form=ed]”)
answered Feb 25 ’12 at 4:17
Gareth DavidsonGareth Davidson4, 6442 gold badges23 silver badges44 bronze badges
BeautifulSoup is a good Python library for dealing with messy HTML in clean ways.
answered Nov 13 ’08 at 2:32
Ned BatchelderNed Batchelder333k69 gold badges526 silver badges635 bronze badges
It seems the question could be more precisely stated as “How to convert HTML to XML so that XPath expressions can be evaluated against it”.
Here are two good tools:
TagSoup, an open-source program, is a Java and SAX – based tool, developed by John Cowan. This is
a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft’s Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code:
answered Nov 13 ’08 at 3:57
Dimitre NovatchevDimitre Novatchev233k26 gold badges286 silver badges416 bronze badges
There is a free C implementation for XML called libxml2 which has some api bits for XPath which I have used with great success which you can specify HTML as the document being loaded. This had worked for me for some less than perfect HTML documents..
For the most part, XPath is most useful when the inbound HTML is properly coded and can be read ‘like an xml document’. You may want to consider using a utility that is specific to this purpose for cleaning up HTML documents. Here is one example:
As far as these XPath tools go- you will likely find that most implementations are actually based on pre-existing C or C++ libraries such as libxml2.
answered Nov 14 ’08 at 1:42
KlathzaztKlathzazt2, 37518 silver badges27 bronze badges
Not the answer you’re looking for? Browse other questions tagged python html ruby xpath parsing or ask your own question.
HTML Scraping - The Hitchhiker's Guide to Python

HTML Scraping – The Hitchhiker’s Guide to Python

Web Scraping¶
Web sites are written using HTML, which means that each web page is a
structured document. Sometimes it would be great to obtain some data from
them and preserve the structure while we’re at it. Web sites don’t always
provide their data in comfortable formats such as CSV or JSON.
This is where web scraping comes in. Web scraping is the practice of using a
computer program to sift through a web page and gather the data that you need
in a format most useful to you while at the same time preserving the structure
of the data.
lxml and Requests¶
lxml is a pretty extensive library written for parsing
XML and HTML documents very quickly, even handling messed up tags in the
process. We will also be using the
Requests module instead of the
already built-in urllib2 module due to improvements in speed and readability.
You can easily install both using pip install lxml and
pip install requests.
Let’s start with the imports:
from lxml import html
import requests
Next we will use to retrieve the web page with our data,
parse it using the html module, and save the results in tree:
page = (”)
tree = omstring(ntent)
(We need to use ntent rather than because
omstring implicitly expects bytes as input. )
tree now contains the whole HTML file in a nice tree structure which
we can go over two different ways: XPath and CSSSelect. In this example, we
will focus on the former.
XPath is a way of locating information in structured documents such as
HTML or XML documents. A good introduction to XPath is on
W3Schools.
There are also various tools for obtaining the XPath of elements such as
FireBug for Firefox or the Chrome Inspector. If you’re using Chrome, you
can right click an element, choose ‘Inspect element’, highlight the code,
right click again, and choose ‘Copy XPath’.
After a quick analysis, we see that in our page the data is contained in
two elements – one is a div with title ‘buyer-name’ and the other is a
span with class ‘item-price’:

Carson Busses

$29. 95
Knowing this we can create the correct XPath query and use the lxml
xpath function like this:
#This will create a list of buyers:
buyers = (‘//div[@title=”buyer-name”]/text()’)
#This will create a list of prices
prices = (‘//span[@class=”item-price”]/text()’)
Let’s see what we got exactly:
print(‘Buyers: ‘, buyers)
print(‘Prices: ‘, prices)
Buyers: [‘Carson Busses’, ‘Earl E. Byrd’, ‘Patty Cakes’,
‘Derri Anne Connecticut’, ‘Moe Dess’, ‘Leda Doggslife’, ‘Dan Druff’,
‘Al Fresco’, ‘Ido Hoe’, ‘Howie Kisses’, ‘Len Lease’, ‘Phil Meup’,
‘Ira Pent’, ‘Ben D. Rules’, ‘Ave Sectomy’, ‘Gary Shattire’,
‘Bobbi Soks’, ‘Sheila Takya’, ‘Rose Tattoo’, ‘Moe Tell’]
Prices: [‘$29. 95’, ‘$8. 37’, ‘$15. 26’, ‘$19. 25’, ‘$19. 25’,
‘$13. 99’, ‘$31. 57’, ‘$8. 49’, ‘$14. 47’, ‘$15. 86’, ‘$11. 11’,
‘$15. 98’, ‘$16. 27’, ‘$7. 50’, ‘$50. 85’, ‘$14. 26’, ‘$5. 68’,
‘$15. 00’, ‘$114. 07’, ‘$10. 09’]
Congratulations! We have successfully scraped all the data we wanted from
a web page using lxml and Requests. We have it stored in memory as two
lists. Now we can do all sorts of cool stuff with it: we can analyze it
using Python or we can save it to a file and share it with the world.
Some more cool ideas to think about are modifying this script to iterate
through the rest of the pages of this example dataset, or rewriting this
application to use threads for improved speed.
Python Parse Html Page With XPath Example ·

Python Parse Html Page With XPath Example ·

Python can be used to write a web page crawler to download web pages. But the web page content is massive and not clear for us to use, we need to filter out the useful data that we need. This article will tell you how to parse the downloaded web page content and filter out the information you need use the python lxml library’s xpath it comes to string content filtering, we immediately think about regular expressions, but we won’t talk about regular expressions today. Because regular expressions are too complex for a crawler that is written by a novice. Moreover, the error tolerance of regular expressions is poor, so if the web page changes slightly, the matching expression will have to be rtunately, Python provides many libraries for parsing HTML pages such as Bs4 BeautifulSoup and Etree in LXML (an XPath parser library). BeautifulSoup looks like a jQuery selector, it looks for Html elements through the id, CSS selector, and tag. Etree’s Xpath method looks for elements primarily through nested relationships of HTML nodes, similar to the path of a file. Below is an example of using Xpath to find Html nodes. #Gets all tr tags under the table tag with id account
path = ‘//table[@id=”account”]//tr’1. LXML Installation and Usage1. 1 Install the LXML librarypip install lxml1. 2 Lxml Xpath UsageBefore using XPath, you need to import the etree class and use this class to process the original Html page content to get an _Element object. Then use its xpath method to get related node values. # Import etree class from lxml
import etree
# Example html content
html = ”’

”’
# Use etree to process html text and return an _Element object which is a dom object.
dom = (html)
# Get a tag’s text. Please Note: The _Element’s xpath method always return a list of html cause there is only one a tag’s text, so we can do like below.
a_tag_text = (‘//div/p/a/text()’)
print(a_tag_text)Save the above code in a file and run command python3, Below is the execution result. I love xpath2. Xpath Syntaxa / b: / represent the hierarchical relationship in XPath. a on the left is the parent node, b on the right is the child node, and b here is the direct child of a. a // b: Double / represents all b nodes under a node should be selected ( no matter it is a direct child node or not). So we can also write above example XPath as //div//a/text(). [@]: Select html nodes by tag’s attributes. //div[@classs]: Select div node with the class attribute, //a[@x]: Select a node with the x attribute, //div[@class=”container”]: Select div node which class’s attribute value is ‘container’. //a[contains(text(), “love”)]: Select the a tag which text content contains string ‘love’. //a[contains(@href, “user_name”)]: Select the a tag which href attribute’s value contains ‘user_name’. //div[contains(@y, “x”)]: Select div tag that has y attribute and y attribute’s value contains ‘x’. 3. Question & Answer. 1 How to use python lxml module to parse out URL address in a web my python script, I use the requests module’s get method to retrieve web content with the page URL. Then I use the python lxml library html module to parse the web page content to a dom tree, my question is how to parse out the URL addresses from the dom tree. Below is my source code. # Import the python requests module.
import requests
# Import the html module from the lxml library.
from lxml import html
# Define the web page url.
web_page_url = ”
# Get the web page content by its url with the requests module’s get() method.
web_page = (web_page_url)
# Get the web page content data.
web_page_content = ntent
# Get the web page dom tree with the html module’s fromstring() function.
dom_tree = omstring(web_page_content)Below is the source code that can read a web page by its URL, and parse out all html a tag link URLs on the web page. # Import the python requests module.
# Import the etree module from the lxml library.
from lxml import etree
# Import StringIO class from the io package.
from io import StringIO
def parse_html_url_link(page_url):
# Create an instance of MLParser class.
html_parser = MLParser()
# Use the python requests module get method to get the web page object with the provided url.
web_page = (page_url)
# Convert the web page bytes content to text string withe the decode method.
web_page_html_string = (“utf-8”)
# Create a StringIO object with the above web page html string.
str_io_obj = StringIO(web_page_html_string)
# Create an etree object.
dom_tree = (str_io_obj, parser=html_parser)
# Get all tag elements in a list
a_tag_list = (“//a”)
# Loop in the html a tag list
for a in a_tag_list:
# Get each a tag href attribute value, the value save the a tag URL link.
url = (‘href’)
# Print out the parsed out URL.
print(url)
if __name__ == ‘__main__’:
parse_html_url_link(“)

Frequently Asked Questions about xpath python html

Can you parse HTML with XPath?

In . Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I’ve used this for a couple years in my .Jan 20, 2011

How do I get HTML data from python?

To scrape a website using Python, you need to perform these four basic steps:Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. … Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.More items…•Dec 19, 2019

How do I get text from HTML to Python?

How to extract text from an HTML file in Pythonurl = “http://kite.com”html = urlopen(url). read()soup = BeautifulSoup(html)for script in soup([“script”, “style”]):script. decompose() delete out tags.strips = list(soup. stripped_strings)print(strips[:5]) print start of list.

Leave a Reply

Your email address will not be published. Required fields are marked *