Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#361051 - 19/02/2014 15:53 Scanning, OCR, PDF and OS X, best ways?
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
It's time I properly digitize my paper life. I spent some time reading about portable scanners with paper feeds. I considered the Canon they recommended, then looked into the Brother model they also mention. Now it's on my desk. Some of the newer features such as wireless scanning to iOS caught my eye.

One key aspect of the Canon and Brother units is their support for TWAIN, meaning any scanner program can talk to them and control them, not just the vendor software.

So that hardware part is done. Now the big challenge, how do I scan and organize my documents so I can find them again? OCR comes to mind, so that I can have full text search. I've started by using the vendor included OCR software, and it generated a PDF where it changed out the scanned image for text. But if I were to reprint the PDF, it would look nothing like the original. So I'm thinking I need some sort of way to scan it and keep the image aspect of the PDF, and have the OCR text included in each page as metadata somehow. That way I can still find documents with full text search, and have the ability to reprint an exact copy if ever needed.

Anyone have any thoughts or experience with doing this on OS X?

Top
#361052 - 19/02/2014 16:19 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
Dignan
carpal tunnel

Registered: 08/03/2000
Posts: 12338
Loc: Sterling, VA
Tom: I know you could predict that this would come from me, but I scan everything to pdf then put it in Google drive. Drive automatically does OCR, so when I search drive for files it can search within the content for my results.

Canon makes great scanners, as does Epson. The fastest scanners are probably the Fujitsu Scansnap.
_________________________
Matt

Top
#361053 - 19/02/2014 16:37 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
K447
old hand

Registered: 29/05/2002
Posts: 798
Loc: near Toronto, Ontario, Canada
I am also a fan of the Fujitsu ScanSnap product.

The Fujitsu bundled OCR software (ABBYY FineReader) does what you require, retaining the original scanned image with the OCR'ed text hidden 'behind' the image. If I mouse select a word or paragraph the highlighting of the OCR text appears right on/under the original image 'text'. Sometimes the alignment of the image text and the OCR text is not perfect but overall it is usually quite good.

There is a specific setting in my Fujitsu FineReader software to make the OCR engine retain the scanned bitmap as the default is to prioritize small output file size and discard all regions that the OCR engine was able to process.

I did find that scanning at a level above 300DPI did seem to improve OCR accuracy but of course the resulting final file size was also larger since I wanted the original scan images included. I typically use 400DPI unless there are graphics or imagery that I really want to have in high quality.



Edited by K447 (19/02/2014 16:38)

Top
#361054 - 19/02/2014 16:41 Re: Scanning, OCR, PDF and OS X, best ways? [Re: Dignan]
K447
old hand

Registered: 29/05/2002
Posts: 798
Loc: near Toronto, Ontario, Canada
Originally Posted By: Dignan
... I scan everything to pdf then put it in Google drive. Drive automatically does OCR, so when I search drive for files it can search within the content for my results...
The uploaded PDF contains just the scanned page images or does the PDF already have the OCR text before it gets to Google?

Top
#361055 - 19/02/2014 16:46 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
tanstaafl.
carpal tunnel

Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
Originally Posted By: drakino
Anyone have any thoughts or experience with doing this on OS X?
As you know full well, I have no experience with OS-X, but I do have some experience scanning with Windows that might well be applicable to one of those, uhhh, what is it, peach, pear, rutabaga, no, wait, Apple machines.

Just before coming to Mexico, I completed a project for a psychologist that involved scanning and organizing 176,512 documents (yes I kept count!). Looking at what you have said above, I have two suggestions, one specific that you can't use because it is too late, and one general that is probably not how you want to do things.

Specific: I would have chosen a different scanner (see this post - watch the video!) but you already have your scanner now. If you have a LOT of pages to scan, you will find five or six seconds a page to be frustrating. Remember, that speed they show is up to 18 pages per minute. Depending on what DPI you set and the complexity of the document, you may see considerably less than that.

General: If you have a LOT of pages to scan, I'd give up on the idea of OCR and text search. You can use the OCR conversion built into Adobe Acrobat and get fair accuracy (99+%) but as you note, it doesn't maintain formatting that well, and also it is s l o w. Alternatively, I have had quite good results with ABBY PDF Transformer. But any OCR conversion is going to be problematic if you are thinking in terms of being able to reprint and use the original document. Even 99% conversion accuracy will leave you with dozens of OCR errors per page.

Instead, leave the scanned pages in their original PDF format, and the time you would otherwise have used fighting with the OCR you can spend organizing the PDF files into directories, subdirectories, and mnemonic filenames. Something like Automotive --> Motorcycle --> 2011 --> Repairs --> 2011-05-24 Jason's Superbike Shop.

If you do that, it will be easier than a text search to find the file you want, and you will have the additional advantage of having it in its original formatting. If you want to edit/change a page and print it, then you could either save the page as a graphic file (.png, .jpg) and edit with a graphics editor, or OCR it and edit it in MS-Word or whatever. But since it is unlikely you will ever need to do anything other than look at these pages on-screen, why go through the considerable extra work of doing OCR?

Now, if some of your PDF files run dozens or even hundreds of pages, finding the exact page you want with the mnemonic filename system will prove difficult, but I suspect your individual files will prove small enough that locating the exact page/paragraph/sentence will be workable.

Anyway, that's how I'd do it.

tanstaafl.
_________________________
"There Ain't No Such Thing As A Free Lunch"

Top
#361056 - 19/02/2014 17:07 Re: Scanning, OCR, PDF and OS X, best ways? [Re: K447]
Dignan
carpal tunnel

Registered: 08/03/2000
Posts: 12338
Loc: Sterling, VA
Originally Posted By: K447
Originally Posted By: Dignan
... I scan everything to pdf then put it in Google drive. Drive automatically does OCR, so when I search drive for files it can search within the content for my results...
The uploaded PDF contains just the scanned page images or does the PDF already have the OCR text before it gets to Google?

It does not need to have the OCR done before uploading. This is handy if your scanner doesn't come with ocr software, as ABBYY is somewhat pricey. But I don't think Drive attaches the ocr to the file. I think it only stores that data with Google. That's fine for me but might not be suitable for others.
_________________________
Matt

Top
#361057 - 19/02/2014 17:49 Re: Scanning, OCR, PDF and OS X, best ways? [Re: tanstaafl.]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
Originally Posted By: Dignan
Drive automatically does OCR, so when I search drive for files it can search within the content for my results.

I was unaware of that capability. But as expected for me, anything Google is out of the question. As is any cloud only solution, as part of the setup will be utilizing Spotlight in OS X heavily.

Originally Posted By: Dignan
The fastest scanners are probably the Fujitsu Scansnap.
Originally Posted By: Fujitsu
The Fujitsu ScanSnap document scanner does not use a TWAIN or ISIS driver.

This automatically excludes a scanner from my list. I am in the return period for mine if it doesn't work out. However. TWAIN or ISIS is an absolute must. I'd rather deal with finding different software during a potential OS upgrade, rather then having to deal with a new scanner due to the old one no longer being supported. On OS X these days, drivers are far less likely to break in an OS upgrade compares to apps. As long as a vendor has drivers that support OS X 10.7 or greater, they are bound to work for a long time. (Due to 10.7 forcing 64 bit drivers).



Originally Posted By: tanstaafl.
Looking at what you have said above, I have two suggestions, one specific that you can't use because it is too late, and one general that is probably not how you want to do things.

I appreciate the feedback, and your experience with 176,512 documents helps smile Thankfully I have much less then that, and once the initial scanning is done, speed won't be a big concern. Going to work to make sure I keep up with scanning as new paperwork comes in. The scanner you ended up with is a bit outside my price range, and size I was willing to dedicate for this device. Impressive speed though.

Originally Posted By: tanstaafl.
Instead, leave the scanned pages in their original PDF format, and the time you would otherwise have used fighting with the OCR you can spend organizing the PDF files into directories, subdirectories, and mnemonic filenames. Something like Automotive --> Motorcycle --> 2011 --> Repairs --> 2011-05-24 Jason's Superbike Shop.

As you somewhat predicted, I'm very much against this approach. I lived with this method for a time for my music, and started to with photos. Moving to a structure where I can sort, search, and change how I see my collection has weened me off being a file janitor. It's proven to be more useful for me in my workflows with photos, music, digital documents, and 16 years of e-mail archives. I want the same approach for my scanned documents. For scanned documents with OCR, I can set up smartfolders that would contain all my car documents simply by searching for the right keywords. Those same documents could be in a second smart folder representing the past year, without copying files all over a folder structure.

*edit* I should add my goal with OCR isn't 100% reproduction. Just good enough to allow searching for documents, and within them. The image saved version would be used in any need for reprinting, or the occasional display on a tablet (likely rare).



Originally Posted By: K447
The Fujitsu bundled OCR software (ABBYY FineReader) does what you require, retaining the original scanned image with the OCR'ed text hidden 'behind' the image. If I mouse select a word or paragraph the highlighting of the OCR text appears right on/under the original image 'text'. Sometimes the alignment of the image text and the OCR text is not perfect but overall it is usually quite good.

This sounds exactly like what I need. Doug also spoke positively about ABBYY software on Windows. However, one thing stick out to me as a negative, from a review of the Mac version:
Originally Posted By: App Store person
There is no question the OCR engine and conversion utilitiy are top notch. But this product is purposefully hobbled by the utter lack of AppleScript or Automator hooks to be useful in automated workflow where the addition of a scanned document to a MacOSX Folder can trigger FineReader OCR Pro run a conversion. Instead, ALL document OCR scans from pre-existing sources require either a drag and drop on FIneReader OCR Pro, or you have to use “File->Open” dialogs.

The lack of AppleScript/Automator support may be an issue. I was planning on possibly building a workflow to help automate the initial scanning backlog I have.


Edited by drakino (19/02/2014 18:38)

Top
#361058 - 19/02/2014 18:59 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
ScanSnap and Evernote, preferably the Evernote special edition scanner. It just works, makes the process so painless.

Insert document, press button. Seconds later scanned and OCR'd PDF is inserted into a new Evernote document.
_________________________
Remind me to change my signature to something more interesting someday

Top
#361061 - 19/02/2014 22:03 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
K447
old hand

Registered: 29/05/2002
Posts: 798
Loc: near Toronto, Ontario, Canada
If it would be helpful I can point you at some online PDF documents I have previously scanned. Mostly instruction manuals with many drawings and some B&W halftone photos.

Or send me a handful of sample documents and I will send back what ABBYY produces from with them smile

The majority of the material I have pushed through my Fujitsu ScanSnap (two different models, currently using an S1500) have been instruction manuals. Many were in the 100-200 page range, double sided.

Very few misfeeds, overall. I have cleaned the feed rollers and rubber friction guides only a few times altogether. Paper dust, mostly.

Feed rate is a few seconds per page (depends on dpi setting, mostly) but the sheet feeder can take a short stack of pages at a time so I just added another bunch every minute or so as the scanner was chewing away on the stack from the bottom. 200+ pages in one go, no problem.

Then the OCR chugs away while I do other things (on or off the computer). I don't recall how long it took for the larger documents, but I recall thinking it was fast enough. Usually by the time I had packed away the original paper document and prepared the next for scanning, the OCR was nearly complete.

For documents of just a page or three, feed times are almost as fast as I can move my hands from dropping them in the input hopper to picking up the emitted pages.

All this is just using a MacBook Air.


Edited by K447 (19/02/2014 23:20)

Top
#361067 - 20/02/2014 16:29 Re: Scanning, OCR, PDF and OS X, best ways? [Re: andy]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5916
Loc: Wivenhoe, Essex, UK
The requirement for Spotlight doesn't need to rule out cloud solutions. Evernote's notes are included in the Spotlight index.
_________________________
Remind me to change my signature to something more interesting someday

Top
#361069 - 20/02/2014 18:18 Re: Scanning, OCR, PDF and OS X, best ways? [Re: K447]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
Originally Posted By: K447
If it would be helpful I can point you at some online PDF documents I have previously scanned.

A sample would be appreciated. Curious to see how the final PDF looks, to be able to compare to what I can generate with my current tools.

Top
#361070 - 20/02/2014 18:25 Re: Scanning, OCR, PDF and OS X, best ways? [Re: andy]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
Originally Posted By: andy
The requirement for Spotlight doesn't need to rule out cloud solutions. Evernote's notes are included in the Spotlight index.

Does Evernote embed the OCR directly into the PDF? And from what I can tell, the processing and storage is all local, with Evernote offering a syncing service in the cloud. This may be more what I'd be willing to deal with, instead of Drive forcing the OCR part as part of an upload to the cloud. Just debating if the ScanSnap + Evernote path is worth paying double what I just did for the scanner. Especially one without TWAIN support, so I'm locked into this one path.

Essentially, I don't want a monthly or yearly fee to store/sync my documents. Accessibility wise, they will live on my own storage, and I can then connect back to my home network if I need to reference one while traveling.

Top
#361074 - 20/02/2014 20:59 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
K447
old hand

Registered: 29/05/2002
Posts: 798
Loc: near Toronto, Ontario, Canada
Originally Posted By: drakino
Originally Posted By: K447
If it would be helpful I can point you at some online PDF documents I have previously scanned.

A sample would be appreciated. Curious to see how the final PDF looks, to be able to compare to what I can generate with my current tools.
PM sent, with links

Top
#361075 - 20/02/2014 21:05 Re: Scanning, OCR, PDF and OS X, best ways? [Re: drakino]
K447
old hand

Registered: 29/05/2002
Posts: 798
Loc: near Toronto, Ontario, Canada
If I was shopping today, I would TRY to find a scanner that did everything else AND was networkable. This would allow multiple computers to use the scanner, compared to USB which means moving the scanner or cable around or trying to software 'share' the scanner.

I have found it handy to be able to scan directly into my iPad or even iPhone using the VueScan app which works nicely with my networked Canon MP990 all-in-one.

A useful trick with VueScan is to 'scan' a page using the iPhone or iPad camera. Works a little better than dealing with regular photos of pages.

Top
#361077 - 20/02/2014 21:36 Re: Scanning, OCR, PDF and OS X, best ways? [Re: K447]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31597
Loc: Seattle, WA
Originally Posted By: K447
I have found it handy to be able to scan directly into my iPad or even iPhone using the VueScan app which works nicely with my networked Canon MP990 all-in-one.


Nifty. Didn't know anyone had made something like that.

Quote:
A useful trick with VueScan is to 'scan' a page using the iPhone or iPad camera. Works a little better than dealing with regular photos of pages.


I've been enjoying having the genius scan app on my iPhone. The results aren't as nice as an actual scanner would be, and sometimes it takes more than one try to get a good readable scan of a document. But in cases where I need to quickly email someone a paper document as a PDF file, and I'm not near a PC and a scanner, it's invaluable. Example: Today my property management company needed me to sign a document. They emailed me the document. I am near a printer here at work, but the networked scanner is quite a ways away and I'm not sure how to connect to it anyway. So I printed the document on the nearby printer, filled it out, and signed it, then used Genius Scan on my phone to make a PDF of the signed document and email it to the property manager, all in one step.
_________________________
Tony Fabris

Top
#361078 - 20/02/2014 21:45 Re: Scanning, OCR, PDF and OS X, best ways? [Re: K447]
drakino
carpal tunnel

Registered: 08/06/1999
Posts: 7868
Originally Posted By: K447
If I was shopping today, I would TRY to find a scanner that did everything else AND was networkable. This would allow multiple computers to use the scanner, compared to USB which means moving the scanner or cable around or trying to software 'share' the scanner.

Agreed, and it's one reason the Brother unit popped to the top of my list. It offers USB in and out (for computer connectivity, and scanning right to a thumb drive), along with WiFi networking. Their iOS companion app is free, but doesn't offer scanning by camera.

Top
#361084 - 20/02/2014 22:21 Re: Scanning, OCR, PDF and OS X, best ways? [Re: K447]
tanstaafl.
carpal tunnel

Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
Originally Posted By: K447
If I was shopping today, I would TRY to find a scanner that did everything else AND was networkable.
A possible downside to the networking idea depends on the relative location of the computers and the scanner. If the scanner is on the first floor and your office is on the fourth floor... to use a scanner you have to have direct physical access to the scanner in order to insert your original document.

Just saying...

tanstaafl.
_________________________
"There Ain't No Such Thing As A Free Lunch"

Top
#361096 - 23/02/2014 18:00 Re: Scanning, OCR, PDF and OS X, best ways? [Re: tanstaafl.]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31597
Loc: Seattle, WA
Much like having a networked printer, the advantage of a networked scanner isn't to work with the device remotely from a distance, but rather, the ease of sharing the device between multiple computers/handhelds, without having to plug it into a networked computer that gets left on all the time just to serve it up. Tom's use case is having his MacBook and his iPhone both being able to use the scanner to obtain a document.
_________________________
Tony Fabris

Top