sql server - Data Comparison -


we have sql server table containing company name, address, , contact name (among others).

we regularly receive data files outside sources require match against table. unfortunately, data different since coming different system. example, have "123 e. main st." , receive "123 east main street". example, have "acme, llc" , file contains "acme inc.". is, have "ed smith" , have "edward smith"

we have legacy system utilizes rather intricate , cpu intensive methods handling these matches. involve pure sql , others involve vba code in access database. current system not perfect , cumbersome , difficult maintain

the management here wants expand use. developers inherit support of system want replace more agile solution requires less maintenance.

is there commonly accepted way dealing kind of data matching?

here's wrote identical stack (we needed standardize manufacturer names hardware , there sorts of variations). client side though (vb.net exact) -- , use levenshtein distance algorithm (modified better results):

    public shared function findmostsimilarstring(byval tofind string, byval paramarray stringlist() string) string         dim bestmatch string = ""         dim bestdistance integer = 1000 'almost should better that!          each matchcandidate string in stringlist             dim candidatedistance integer = levenshteindistance(tofind, matchcandidate)             if candidatedistance < bestdistance                 bestmatch = matchcandidate                 bestdistance = candidatedistance             end if         next          return bestmatch     end function      'this used determine how similar strings are.  modified link below...     'fxn from: http://ca0v.terapad.com/index.cfm?fa=contentnews.newsdetails&newsid=37030&from=list     public shared function levenshteindistance(byval s string, byval t string) integer         dim slength integer = s.length ' length of s         dim tlength integer = t.length ' length of t         dim lvcost integer ' cost         dim lvdistance integer = 0         dim zerocostcount integer = 0          try             ' step 1             if tlength = 0                 return slength             elseif slength = 0                 return tlength             end if              dim lvmatrixsize integer = (1 + slength) * (1 + tlength)             dim pobuffer() integer = new integer(0 lvmatrixsize - 1) {}              ' fill first row             lvindex integer = 0 slength                 pobuffer(lvindex) = lvindex             next              'fill first column             lvindex integer = 1 tlength                 pobuffer(lvindex * (slength + 1)) = lvindex             next              lvrowindex integer = 0 slength - 1                 dim s_i char = s(lvrowindex)                 lvcolindex integer = 0 tlength - 1                     if s_i = t(lvcolindex)                         lvcost = 0                         zerocostcount += 1                     else                         lvcost = 1                     end if                     ' step 6                     dim lvtopleftindex integer = lvcolindex * (slength + 1) + lvrowindex                     dim lvtopleft integer = pobuffer(lvtopleftindex)                     dim lvtop integer = pobuffer(lvtopleftindex + 1)                     dim lvleft integer = pobuffer(lvtopleftindex + (slength + 1))                     lvdistance = math.min(lvtopleft + lvcost, math.min(lvleft, lvtop) + 1)                     pobuffer(lvtopleftindex + slength + 2) = lvdistance                 next             next         catch ex threadabortexception             err.clear()         catch ex exception             writedebugmessage(application.startuppath , [assembly].getexecutingassembly().getname.name.tostring, methodbase.getcurrentmethod.name, err)         end try          return lvdistance - zerocostcount     end function 

Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -