algorithm - Finding groups of similar strings in a large set of strings -
i have reasonably large set of strings (say 100) has number of subgroups characterised similarity. trying find/design algorithm find theses groups reasonably efficiently.
as example let's input list on left below, , output groups on right.
input output ----------------- ----------------- jane doe mr philip roberts mr philip roberts phil roberts foo mcbar philip roberts david jones phil roberts foo mcbar davey jones => john smith david jones philip roberts dave jones dave jones davey jones jonny smith jane doe john smith jonny smith
does know of ways solve reasonably efficiently?
the standard method finding similar strings seems levenshtein distance, can't see how can make use of here without having compare every string every other string in list, , somehow decide on difference threshold deciding if 2 strings in same group or not.
an alternative algorithm hashes strings down integer, similar strings hash integers close on number-line. have no idea algorithm though, if 1 exists
does have thoughts/pointers?
update: @will a: perhaps names weren't example first thought. starting point think can assume in data working with, small change in string not make jump 1 group another.
another popular method associate strings jaccard index. start http://en.wikipedia.org/wiki/jaccard_index.
here's article using jaccard-index (and couple of other methods) solve problem yours:
Comments
Post a Comment