Become a fan of Slashdot on Facebook


Forgot your password?
Open Source Linux IT

Open Source OCR That Makes Searchable PDFs 133

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."
This discussion has been archived. No new comments can be posted.

Open Source OCR That Makes Searchable PDFs

Comments Filter:
  • Re:Thanks! (Score:3, Informative)

    by godrik ( 1287354 ) on Thursday July 22, 2010 @03:32PM (#32994580)

    Same here. Thank you too!

    (I know this post is very redundant and useless. But thanks are always welcome, aren't they ?)

  • Re:commercial? (Score:2, Informative)

    by Anonymous Coward on Thursday July 22, 2010 @03:44PM (#32994752)

    After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.

  • Re:commercial? (Score:4, Informative)

    by ganjadude ( 952775 ) on Thursday July 22, 2010 @03:46PM (#32994786) Homepage
    there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7)
  • by petermgreen ( 876956 ) <plugwash @ p> on Thursday July 22, 2010 @05:15PM (#32996182) Homepage

    Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.

    You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.

  • Tesseract OCR (Score:3, Informative)

    by TheSync ( 5291 ) on Thursday July 22, 2010 @05:35PM (#32996454) Journal

    I found tesseract [] to work very well to do OCR tasks. Doesn't generate PDF though.

  • by It's the tripnaut! ( 687402 ) on Thursday July 22, 2010 @07:18PM (#32997722) Homepage

    While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

    I've tried quite a few free and proprietary OCR's and the best available right now, imho, is ABBYY Finereader []. Other than fonts, it also easily recognizes tables, diagrams and illustrations. But most of all, it can read and render 189 languages (including Chinese and Cyrillic) accurately. A free trial version is available.

Information is the inverse of entropy.