PDF Image -> PDF Txt?

Posted by: Tim

PDF Image -> PDF Txt? - 25/05/2007 12:40

We got some new printers here with the capability to scan at the printer and save the PDF it creates on your computer back at your desk. The only issue with this is it takes what is scanned, converts it to an image and then saves that image in a PDF.

I have a TON of old report (from the mid 70s on) that I would like to scan. My question is, is there any way to take an image in a PDF and convert it into text? I found several PDF -> Word/Power Point/Excel converters, but they just rip the image out and paste it into the other application. I'm looking to convert that picture back into the text, using some kind of OCR I guess.

Anybody have any hints/tips on the easiest/cheapest way to do this?

Thanks!
Posted by: Schido

Re: PDF Image -> PDF Txt? - 25/05/2007 13:10

SimpleOCR?

http://www.simpleocr.com/

It only opens bmp, jpg, and tiff (and won't handle lzw tiffs it seems), and can only save as doc or txt.
But hey, it's free. (For non-commercial use i noticed now)
Posted by: tfabris

Re: PDF Image -> PDF Txt? - 25/05/2007 15:18

Quote:
The only issue with this is it takes what is scanned, converts it to an image and then saves that image in a PDF.

I'm probably being pedantic here, but just to be clear: All scanners can ever do is make images. It was always an image, so it was never "converted to an image".

The thing you're asking to do is, as you already know, OCR. Which is pre-built into many scanner software packages already. You said this was a new scanner/printer, so I'm surprised it's not already doing that for you. I'd look more carefully at the documentation and the bundled software that came with the printer. It's probably just a question of installing the right disc, clicking on the right icon, or setting the right configuration check box.
Posted by: tanstaafl.

Re: PDF Image -> PDF Txt? - 25/05/2007 21:43

Quote:
The thing you're asking to do is, as you already know, OCR.


Admittedly it's been 10 years or more since I played with OCR, but has the software/hardware improved enough to be actually useful?

When I was trying it, they were bragging about 99% accuracy. That's all well and good, but on a full text page that amounts to about 40 errors I'd have to find and correct. It was a tossup whether it was easier to just type the document myself or scan it in and then spend 10 minutes fixing it. Add in the fact that it would try and divide the page up into frames based on the layout of the original document (all I wanted was just the plain text, conveniently paragraphed) and it just wasn't worth the trouble.

I take it things are better now?

tanstaafl.
Posted by: msaeger

Re: PDF Image -> PDF Txt? - 26/05/2007 01:14

I haven't used it too much but I have been impressed the few times I have lately. I think it has improved.
Posted by: Tim

Re: PDF Image -> PDF Txt? - 26/05/2007 14:57

When I asked our IT department about having it as words instead of images, they said the printers don't support that. Not sure if they really know what they are talking about, but the configuration of the printers is locked down and a password is required, so I can't go in and dork around to see if it is possible.

I do know there is nothing available in the software package that came with the printer. I think the fact that the computers and printer are locked down will make it almost impossible to get what I'm looking for.
Posted by: lectric

Re: PDF Image -> PDF Txt? - 26/05/2007 15:52

The printers typically WON'T support what you are asking for. There are plenty of software packages that can OCR images after the fact. Like Omnipage.
Posted by: tfabris

Re: PDF Image -> PDF Txt? - 26/05/2007 20:04

Quote:
they said the printers don't support that.

I don't know of any printers with built-in OCR, I mean that scanner/printers frequently come bundled with third-party OCR software. Since your IT people aren't handing you those disks, I suppose you're stuck trying to find your own OCR solution. One was linked earlier in this thread, I'm sure there are gajillions of them.
Posted by: altman

Re: PDF Image -> PDF Txt? - 27/05/2007 06:22

Acrobat (full version) does OCR on "scanned" PDFs, making a PDF with the text "behind" the scanned image - hence it still looks exactly like the original doc, but you can search. Acrobat also has multi-file search so you can search a whole hierarchy.

I've got one of the fuji scansnap multi-page scanners that:

a) Come with the full version of Acrobat
b) Come with an OCR program that will scan, OCR, and save multipage (and duplex) documents with a single press on the scanner's scan button.
c) Really are very good value. They even do colour.

See http://empegbbs.com/ubbthreads/showflat.php?Cat=0&Board=offtopic&Number=249655

Hugo