algorithm - Finding groups of similar strings in a large set of strings -


i have reasonably large set of strings (say 100) has number of subgroups characterised similarity. trying find/design algorithm find theses groups reasonably efficiently.

as example let's input list on left below, , output groups on right.

input                           output -----------------               ----------------- jane doe                        mr philip roberts mr philip roberts               phil roberts      foo mcbar                       philip roberts    david jones                      phil roberts                    foo mcbar         davey jones            =>          john smith                      david jones       philip roberts                  dave jones        dave jones                      davey jones       jonny smith                                                      jane doe                                           john smith                                        jonny smith  

does know of ways solve reasonably efficiently?

the standard method finding similar strings seems levenshtein distance, can't see how can make use of here without having compare every string every other string in list, , somehow decide on difference threshold deciding if 2 strings in same group or not.

an alternative algorithm hashes strings down integer, similar strings hash integers close on number-line. have no idea algorithm though, if 1 exists

does have thoughts/pointers?


update: @will a: perhaps names weren't example first thought. starting point think can assume in data working with, small change in string not make jump 1 group another.

another popular method associate strings jaccard index. start http://en.wikipedia.org/wiki/jaccard_index.

here's article using jaccard-index (and couple of other methods) solve problem yours:

http://matpalm.com/resemblance/


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -