html - Python: Detecting the actual text paragraphs in a string -

- March 15, 2013

the big mission: trying few lines of summary of webpage. i.e. want have function takes url , returns informative paragraph page. (which first paragraph of actual content text, in contrast "junk text", navigation bar.)

so managed reduce html page bunch of text cutting out tags, throwing out <head> , scripts. of text still "junk text". want know actual paragraphs of text begin. (ideally should human-language-agnostic, if have solution english, might too.)

how can figure out of text "junk text" , actual content?

update: see people have pointed me use html parsing library. using beautiful soup. problem isn't parsing html; got rid of html tags, have bunch of text , want separate context text junk text.

you use approach outlined @ ai depot blog along python code:

the easy way extract useful text arbitrary html

Search This Blog

Ray access

html - Python: Detecting the actual text paragraphs in a string -

Comments

Post a Comment

Popular posts from this blog

windows - Why does Vista not allow creation of shortcuts to "Programs" on a NonAdmin account? Not supposed to install apps from NonAdmin account? -

c++ - How do I get a multi line tooltip in MFC -

unit testing - How to mock PreferenceManager in Android? -