ocr - Programmatically recognize text from scans in a PDF File -
i have pdf file, contains data need import database. files seem pdf scans of printed alphanumeric text. looks 10 pt. times new roman.
are there tools or components can allow me recognize , parse text?
i've used pdftohtml strip tables out of pdf csv. it's based on xpdf, more general purpose tool, includes pdftotext. wrap process.start call c#.
if you're looking little more diy, there's itextsharp library - port of java's itext - , pdfbox (yes, says java - have .net version way of ikvm.net). here's codeproject articles on using itextsharp , pdfbox c#.
and, if you're really masochist, call adobe's pdf ifilter com interop. ifilter specs pretty simple, guess interop overhead significant.
edit: after re-reading question , subsequent answers, it's become clear op dealing images in pdf. in case, you'll need extract images (the pdf libraries above able easily) , run through ocr engine.
i've used modi interactively before, decent results. it's com, calling c# via interop doable , pretty simple:
' lifted http://en.wikipedia.org/wiki/microsoft_office_document_imaging dim inputfile string = "c:\test\multipage.tif" dim strrectext string = "" dim doc1 modi.document doc1 = new modi.document doc1.create(inputfile) doc1.ocr() ' ocr pages of multi-page tiff file doc1.save() ' save deskewed reoriented images, , ocr text, inputfile imagecounter integer = 0 (doc1.images.count - 1) ' work way through each page of results strrectext &= doc1.images(imagecounter).layout.text ' puts ocr results string next file.appendalltext("c:\test\testmodi.txt", strrectext) ' write ocr file out disk doc1.close() ' clean doc1 = nothing
others tesseract, have direct experience it. i've heard both , bad things it, imagine depends on source quality.
Comments
Post a Comment