Mastering the lxml
Python Library for Powerful XML and HTML Parsing
The lxml
library is a highly efficient and feature-rich Python library for processing XML and HTML documents. It combines the speed and XML capabilities of libxml2
with the convenience of a Pythonic API. In this blog post, we will delve into lxml
, explain its core APIs, and provide numerous code examples to help you get started. Additionally, we’ll build a real-world application using these APIs for a practical understanding of the library’s capabilities.
Key Features of lxml
- Efficient and Pythonic XML/HTML parsing and generation.
- Full support for XPath and XSLT.
- Complete compatibility with
ElementTree
. - Integration with external XML libraries like
libxml2
andlibxslt
.
Installation
You can install lxml
using pip:
pip install lxml
Working with XML
1. Parsing XML Files
Parse an XML document and access its elements:
from lxml import etree xml_data = '''<root> <child name="child1"/> <child name="child2"/> </root>''' tree = etree.fromstring(xml_data) for child in tree: print(child.tag, child.attrib)
2. Creating XML Documents
Generate XML documents programmatically:
from lxml import etree root = etree.Element("root") child1 = etree.SubElement(root, "child", name="child1") child2 = etree.SubElement(root, "child", name="child2") print(etree.tostring(root, pretty_print=True).decode("utf-8"))
3. Using XPath Queries
Extract elements using XPath:
from lxml import etree xml_data = '''<root> <child name="child1"/> <child name="child2"/> </root>''' tree = etree.fromstring(xml_data) result = tree.xpath("//child[@name='child1']") print(result[0].tag, result[0].attrib)
4. Applying XSLT Transformation
Transform XML using XSLT:
from lxml import etree xml_data = '''<root> <message>Hello World!</message> </root>''' xslt_data = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="message"> <greeting><xsl:value-of select="."/></greeting> </xsl:template> </xsl:stylesheet>''' xml_tree = etree.fromstring(xml_data) xslt_tree = etree.fromstring(xslt_data) transform = etree.XSLT(xslt_tree) result_tree = transform(xml_tree) print(etree.tostring(result_tree, pretty_print=True).decode("utf-8"))
Working with HTML
The lxml
library makes parsing and manipulating HTML a simple task:
1. Parsing HTML
Use the html
module for parsing:
from lxml import html html_data = '''<html> <body><p>Hello, World!</p></body> </html>''' tree = html.fromstring(html_data) paragraph = tree.xpath("//p")[0] print(paragraph.text)
2. Cleaning HTML
Remove unwanted tags using lxml.html.clean
:
from lxml.html.clean import Cleaner html_data = ''' <div><script>alert("Hello")</script> <p>Content</p></div>''' cleaner = Cleaner(javascript=True, style=True) cleaned_html = cleaner.clean_html(html_data) print(cleaned_html)
Real-World Application: XML-to-HTML Converter
Here is an example application that reads an XML file and converts it into an HTML table.
from lxml import etree, html xml_data = '''<root> <item><name>Item1</name><price>10</price></item> <item><name>Item2</name><price>20</price></item> </root>''' tree = etree.fromstring(xml_data) table = etree.Element("table") for item in tree.xpath("//item"): row = etree.SubElement(table, "tr") name = etree.SubElement(row, "td") name.text = item.findtext("name") price = etree.SubElement(row, "td") price.text = item.findtext("price") print(html.tostring(table, pretty_print=True).decode("utf-8"))
lxml
is a powerful library for handling XML and HTML in Python. With its extensive API and integration with external libraries, it is a must-know for developers working with structured data formats. Whether you’re parsing website data or transforming XML, lxml
has you covered!