html - Issue with Regular expressions in python -
ok, i'm working on regular expression search out header information in site.
i've compiled regular expression:
regex = re.compile(r''' <h[0-9]>\s? (<a[ ]href="[a-za-z0-9.]*">)?\s? [a-za-z0-9.,:'"=/?;\s]*\s? [a-za-z0-9.,:'"=/?;\s]? ''', re.x)
when run in python reg ex. tester, works out wonderfully.
sample data:
<body> <h1>dog </h1> <h2>cat </h2> <h3>fancy </h3> <h1>tall cup of lemons</h1> <h1><a href="dog.com">dog thing</a></h1> </body>
now, in redemo, works wonderfully.
when put in python code, however, prints <a href="dog.com">
here's python code, i'm not sure if i'm doing wrong or if lost in translation. appreciate help.
stories=[] response = urllib2.urlopen('http://apricotclub.org/duh.html') html = response.read().lower() p = re.compile('<h[0-9]>\\s?(<a href=\"[a-za-z0-9.]*\">)?\\s?[a-za-z0-9.,:\'\"=/?;\\s]*\\s?[a-za-z0-9.,:\'\"=/?;\\s]?') stories=re.findall(p, html) in stories: if len(i) >= 5: print
i should note, when take out (<a href=\"[a-za-z0-9.]*\">)?
regular expression works fine non-link <hn>
lines.
this question has been asked in several forms on last few days, i'm going clearly.
q: how parse html regular expressions?
a: please don't.
use beautifulsoup, html5lib or lxml.html. please.
Comments
Post a Comment