html - Issue with Regular expressions in python -


ok, i'm working on regular expression search out header information in site.

i've compiled regular expression:

regex = re.compile(r'''     <h[0-9]>\s?     (<a[ ]href="[a-za-z0-9.]*">)?\s?     [a-za-z0-9.,:'"=/?;\s]*\s?     [a-za-z0-9.,:'"=/?;\s]? ''',  re.x) 

when run in python reg ex. tester, works out wonderfully.

sample data:

<body>     <h1>dog </h1>     <h2>cat </h2>     <h3>fancy </h3>     <h1>tall cup of lemons</h1>     <h1><a href="dog.com">dog thing</a></h1> </body> 

now, in redemo, works wonderfully.

when put in python code, however, prints <a href="dog.com">

here's python code, i'm not sure if i'm doing wrong or if lost in translation. appreciate help.

stories=[] response = urllib2.urlopen('http://apricotclub.org/duh.html') html = response.read().lower() p = re.compile('<h[0-9]>\\s?(<a href=\"[a-za-z0-9.]*\">)?\\s?[a-za-z0-9.,:\'\"=/?;\\s]*\\s?[a-za-z0-9.,:\'\"=/?;\\s]?') stories=re.findall(p, html) in stories:     if len(i) >= 5:         print  

i should note, when take out (<a href=\"[a-za-z0-9.]*\">)? regular expression works fine non-link <hn> lines.

this question has been asked in several forms on last few days, i'm going clearly.

q: how parse html regular expressions?

a: please don't.

use beautifulsoup, html5lib or lxml.html. please.


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -