html - Python: Detecting the actual text paragraphs in a string -


the big mission: trying few lines of summary of webpage. i.e. want have function takes url , returns informative paragraph page. (which first paragraph of actual content text, in contrast "junk text", navigation bar.)

so managed reduce html page bunch of text cutting out tags, throwing out <head> , scripts. of text still "junk text". want know actual paragraphs of text begin. (ideally should human-language-agnostic, if have solution english, might too.)

how can figure out of text "junk text" , actual content?

update: see people have pointed me use html parsing library. using beautiful soup. problem isn't parsing html; got rid of html tags, have bunch of text , want separate context text junk text.

you use approach outlined @ ai depot blog along python code:


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -