Open Source OCR That Makes Searchable PDFs 133

Posted by timothy on Thursday July 22, 2010 @03:21PM from the word-of-advice dept.

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

Open Source OCR That Makes Searchable PDFs

This discussion has been archived. No new comments can be posted.

Search 133 Comments Log In/Create an Account

Comments Filter:

Re:Thanks! (Score:3, Informative)

by godrik ( 1287354 ) writes: on Thursday July 22, 2010 @03:32PM (#32994580)

Same here. Thank you too!
(I know this post is very redundant and useless. But thanks are always welcome, aren't they ?)

Re:commercial? (Score:2, Informative)

by Anonymous Coward writes: on Thursday July 22, 2010 @03:44PM (#32994752)

After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.

Re:commercial? (Score:4, Informative)

by ganjadude ( 952775 ) writes: on Thursday July 22, 2010 @03:46PM (#32994786) Homepage

there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7) www.docuware.com

Re:better alternatives to pdftohtml (Score:3, Informative)

by petermgreen ( 876956 ) writes: <plugwashNO@SPAMp10link.net> on Thursday July 22, 2010 @05:15PM (#32996182) Homepage

Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.
You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.

Tesseract OCR (Score:3, Informative)

by TheSync ( 5291 ) writes: on Thursday July 22, 2010 @05:35PM (#32996454) Homepage Journal

I found tesseract [google.com] to work very well to do OCR tasks. Doesn't generate PDF though.

Re:Thanks for the info... (Score:2, Informative)

by It's the tripnaut! ( 687402 ) writes: on Thursday July 22, 2010 @07:18PM (#32997722) Homepage

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
I've tried quite a few free and proprietary OCR's and the best available right now, imho, is ABBYY Finereader [abbyy.com]. Other than fonts, it also easily recognizes tables, diagrams and illustrations. But most of all, it can read and render 189 languages (including Chinese and Cyrillic) accurately. A free trial version is available.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Open Source OCR That Makes Searchable PDFs 133

Open Source OCR That Makes Searchable PDFs More Login

Open Source OCR That Makes Searchable PDFs

Re:Thanks! (Score:3, Informative)

Re:commercial? (Score:2, Informative)

Re:commercial? (Score:4, Informative)

Re:better alternatives to pdftohtml (Score:3, Informative)

Tesseract OCR (Score:3, Informative)

Re:Thanks for the info... (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot