xml - Why doesn't xpath work when processing an XHTML document with lxml (in python)? -
i testing against following test document:
<?xml version="1.0" encoding="utf-8"?> <!doctype html public "-//w3c//dtd xhtml 1.0 strict//en" "http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>hi there</title> </head> <body> <img class="foo" src="bar.png"/> </body> </html>
if parse document using lxml.html, can img xpath fine:
>>> root = lxml.html.fromstring(doc) >>> root.xpath("//img") [<element img @ 1879e30>]
however, if parse document xml , try img tag, empty result:
>>> tree = etree.parse(stringio(doc)) >>> tree.getroot().xpath("//img") []
i can navigate element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0] <element {http://www.w3.org/1999/xhtml}img @ f56810>
but of course doesn't me process arbitrary documents. expect able query etree xpath expression directly identify element, which, technically can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0]) '/*/*[2]/*' >>> tree.getroot().xpath('/*/*[2]/*') [<element {http://www.w3.org/1999/xhtml}img @ fa1750>]
but xpath is, again, not useful parsing arbitrary documents.
obviously missing key issue here, don't know is. best guess has namespaces namespace defined default , don't know else might need consider in regards namespaces.
so, missing?
the problem namespaces. when parsed xml, img tag in http://www.w3.org/1999/xhtml namespace since default namespace element. asking img tag in no namespace.
try this:
>>> tree.getroot().xpath( ... "//xhtml:img", ... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'} ... ) [<element {http://www.w3.org/1999/xhtml}img @ 11a29e0>]
Comments
Post a Comment