ocr - Programmatically recognize text from scans in a PDF File -


i have pdf file, contains data need import database. files seem pdf scans of printed alphanumeric text. looks 10 pt. times new roman.

are there tools or components can allow me recognize , parse text?

i've used pdftohtml strip tables out of pdf csv. it's based on xpdf, more general purpose tool, includes pdftotext. wrap process.start call c#.

if you're looking little more diy, there's itextsharp library - port of java's itext - , pdfbox (yes, says java - have .net version way of ikvm.net). here's codeproject articles on using itextsharp , pdfbox c#.

and, if you're really masochist, call adobe's pdf ifilter com interop. ifilter specs pretty simple, guess interop overhead significant.

edit: after re-reading question , subsequent answers, it's become clear op dealing images in pdf. in case, you'll need extract images (the pdf libraries above able easily) , run through ocr engine.

i've used modi interactively before, decent results. it's com, calling c# via interop doable , pretty simple:

' lifted http://en.wikipedia.org/wiki/microsoft_office_document_imaging dim inputfile string = "c:\test\multipage.tif" dim strrectext string = "" dim doc1 modi.document  doc1 = new modi.document doc1.create(inputfile) doc1.ocr()  ' ocr pages of multi-page tiff file doc1.save() ' save deskewed reoriented images, , ocr text, inputfile  imagecounter integer = 0 (doc1.images.count - 1) ' work way through each page of results    strrectext &= doc1.images(imagecounter).layout.text    ' puts ocr results string next  file.appendalltext("c:\test\testmodi.txt", strrectext)     ' write ocr file out disk  doc1.close() ' clean doc1 = nothing 

others tesseract, have direct experience it. i've heard both , bad things it, imagine depends on source quality.


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -