sql server - Data Comparison -
we have sql server table containing company name, address, , contact name (among others).
we regularly receive data files outside sources require match against table. unfortunately, data different since coming different system. example, have "123 e. main st." , receive "123 east main street". example, have "acme, llc" , file contains "acme inc.". is, have "ed smith" , have "edward smith"
we have legacy system utilizes rather intricate , cpu intensive methods handling these matches. involve pure sql , others involve vba code in access database. current system not perfect , cumbersome , difficult maintain
the management here wants expand use. developers inherit support of system want replace more agile solution requires less maintenance.
is there commonly accepted way dealing kind of data matching?
here's wrote identical stack (we needed standardize manufacturer names hardware , there sorts of variations). client side though (vb.net exact) -- , use levenshtein distance algorithm (modified better results):
public shared function findmostsimilarstring(byval tofind string, byval paramarray stringlist() string) string dim bestmatch string = "" dim bestdistance integer = 1000 'almost should better that! each matchcandidate string in stringlist dim candidatedistance integer = levenshteindistance(tofind, matchcandidate) if candidatedistance < bestdistance bestmatch = matchcandidate bestdistance = candidatedistance end if next return bestmatch end function 'this used determine how similar strings are. modified link below... 'fxn from: http://ca0v.terapad.com/index.cfm?fa=contentnews.newsdetails&newsid=37030&from=list public shared function levenshteindistance(byval s string, byval t string) integer dim slength integer = s.length ' length of s dim tlength integer = t.length ' length of t dim lvcost integer ' cost dim lvdistance integer = 0 dim zerocostcount integer = 0 try ' step 1 if tlength = 0 return slength elseif slength = 0 return tlength end if dim lvmatrixsize integer = (1 + slength) * (1 + tlength) dim pobuffer() integer = new integer(0 lvmatrixsize - 1) {} ' fill first row lvindex integer = 0 slength pobuffer(lvindex) = lvindex next 'fill first column lvindex integer = 1 tlength pobuffer(lvindex * (slength + 1)) = lvindex next lvrowindex integer = 0 slength - 1 dim s_i char = s(lvrowindex) lvcolindex integer = 0 tlength - 1 if s_i = t(lvcolindex) lvcost = 0 zerocostcount += 1 else lvcost = 1 end if ' step 6 dim lvtopleftindex integer = lvcolindex * (slength + 1) + lvrowindex dim lvtopleft integer = pobuffer(lvtopleftindex) dim lvtop integer = pobuffer(lvtopleftindex + 1) dim lvleft integer = pobuffer(lvtopleftindex + (slength + 1)) lvdistance = math.min(lvtopleft + lvcost, math.min(lvleft, lvtop) + 1) pobuffer(lvtopleftindex + slength + 2) = lvdistance next next catch ex threadabortexception err.clear() catch ex exception writedebugmessage(application.startuppath , [assembly].getexecutingassembly().getname.name.tostring, methodbase.getcurrentmethod.name, err) end try return lvdistance - zerocostcount end function
Comments
Post a Comment