html - Python: Detecting the actual text paragraphs in a string -
the big mission: trying few lines of summary of webpage. i.e. want have function takes url , returns informative paragraph page. (which first paragraph of actual content text, in contrast "junk text", navigation bar.)
so managed reduce html page bunch of text cutting out tags, throwing out <head>
, scripts. of text still "junk text". want know actual paragraphs of text begin. (ideally should human-language-agnostic, if have solution english, might too.)
how can figure out of text "junk text" , actual content?
update: see people have pointed me use html parsing library. using beautiful soup. problem isn't parsing html; got rid of html tags, have bunch of text , want separate context text junk text.
you use approach outlined @ ai depot blog along python code:
Comments
Post a Comment