xml - Why doesn't xpath work when processing an XHTML document with lxml (in python)? -


i testing against following test document:

<?xml version="1.0" encoding="utf-8"?> <!doctype html public "-//w3c//dtd xhtml 1.0 strict//en"                        "http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">    <head>         <title>hi there</title>     </head>     <body>         <img class="foo" src="bar.png"/>     </body> </html> 

if parse document using lxml.html, can img xpath fine:

>>> root = lxml.html.fromstring(doc) >>> root.xpath("//img") [<element img @ 1879e30>] 

however, if parse document xml , try img tag, empty result:

>>> tree = etree.parse(stringio(doc)) >>> tree.getroot().xpath("//img") [] 

i can navigate element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0] <element {http://www.w3.org/1999/xhtml}img @ f56810> 

but of course doesn't me process arbitrary documents. expect able query etree xpath expression directly identify element, which, technically can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0]) '/*/*[2]/*' >>> tree.getroot().xpath('/*/*[2]/*') [<element {http://www.w3.org/1999/xhtml}img @ fa1750>] 

but xpath is, again, not useful parsing arbitrary documents.

obviously missing key issue here, don't know is. best guess has namespaces namespace defined default , don't know else might need consider in regards namespaces.

so, missing?

the problem namespaces. when parsed xml, img tag in http://www.w3.org/1999/xhtml namespace since default namespace element. asking img tag in no namespace.

try this:

>>> tree.getroot().xpath( ...     "//xhtml:img",  ...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'} ...     ) [<element {http://www.w3.org/1999/xhtml}img @ 11a29e0>] 

Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -