Open Source OCR That Makes Searchable PDFs

Open Source OCR That Makes Searchable PDFs 133

Posted by timothy on Thursday July 22, 2010 @03:21PM from the word-of-advice dept.

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

Open Source OCR That Makes Searchable PDFs

This discussion has been archived. No new comments can be posted.

Search 133 Comments Log In/Create an Account

Comments Filter:

Thanks! (Score:5, Insightful)

by Fast Thick Pants ( 1081517 ) writes: <fastthickpants@gmail . c om> on Thursday July 22, 2010 @03:23PM (#32994454)

Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!

Run on a VM (Score:3, Insightful)

by ChuckDriver ( 1276092 ) writes: on Thursday July 22, 2010 @03:32PM (#32994568)

Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.

Anyone got error rates? (Score:4, Insightful)

by savanik ( 1090193 ) writes: on Thursday July 22, 2010 @03:39PM (#32994680)

I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.

Re:Wait a sec (Score:4, Insightful)

by ushering05401 ( 1086795 ) writes: on Thursday July 22, 2010 @03:51PM (#32994870) Journal

Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?

Stupid (Score:2, Insightful)

by Archangel Michael ( 180766 ) writes: on Thursday July 22, 2010 @04:04PM (#32995098) Journal

Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???

Re:Thanks! (Score:3, Insightful)

by tsstahl ( 812393 ) writes: on Thursday July 22, 2010 @04:41PM (#32995612)

Virtual machine?

Re:Stupid ... maybe (Score:1, Insightful)

by Anonymous Coward writes: on Thursday July 22, 2010 @05:00PM (#32995946)

Not everyone wanting to do this does in fact have access to the electronic source. I know I would like to try it for some my old crumbling books, as someone else mentioned above, no longer in print (or otherwise only available in DRM-encumbered ebook formats that I cannot read on Linux or Windows Mobile).
RO

Re:Thanks! (Score:4, Insightful)

by TooMuchToDo ( 882796 ) writes: on Thursday July 22, 2010 @05:43PM (#32996564)

Looks like Slashdot needs a moderation "+1 Thank You!" option.

Re:exactimage + cuneiform (Score:3, Insightful)

by kilf ( 135983 ) writes: on Friday July 23, 2010 @07:41AM (#33001458) Homepage

I'd love to see your script, if you want to make it available.

Re:Why on server? (Score:1, Insightful)

by Anonymous Coward writes: on Friday July 23, 2010 @10:55AM (#33003116)

If you are running a high speed scanner that scans 100ppm/200ipm, the computer would not be able to OCR the pages fast enough to keep up with the scanner throughput. Since you are paying good money for that scanner (and the operator running it), you want to get every possible image through that scanner per day. The OCR can be done after the fact on a server that only needs to be periodically monitored by IT.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Open Source OCR That Makes Searchable PDFs 133

Open Source OCR That Makes Searchable PDFs More Login

Open Source OCR That Makes Searchable PDFs

Thanks! (Score:5, Insightful)

Run on a VM (Score:3, Insightful)

Anyone got error rates? (Score:4, Insightful)

Re:Wait a sec (Score:4, Insightful)

Stupid (Score:2, Insightful)

Re:Thanks! (Score:3, Insightful)

Re:Stupid ... maybe (Score:1, Insightful)

Re:Thanks! (Score:4, Insightful)

Re:exactimage + cuneiform (Score:3, Insightful)

Re:Why on server? (Score:1, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot