Digital Cameras vs Scanners for OCR? 95
ttennebkram asks: "With 6 and 8 Megapixel cameras on the market, some now with Wifi built in, it might be more convenient to shoot pictures of your bills and papers with a camera than fussing with the scanner. By the numbers, it would seem feasible. 300dpi for an 8.5"x11" sheet of paper works out to about 8 megapixels; 300 dpi is usually what OCR vendors suggest. I imagine for high volume good results you'd want to maybe mount the camera on a tripod arm over your desk. Heck, I was thinking of a glass desk and maybe one camera below and one above, and maybe a foot pedal to trigger the cameras (and I suppose a flash and high F-stop would help as well). If I could quickly 'snap' all the junk paper I have and electronically file it, maybe OCR the images at night in batch while I'm asleep, and then maybe get rid of all that paper once and for all. Using a traditional cheap scanner just takes too long. So has anybody tried this? I realize that camera optics are different than scanner optics, so maybe it's not just a question of raw pixel counts. Any thoughts?"
Aspect Ratio and Even Lighting (Score:5, Insightful)
Re:Aspect Ratio and Even Lighting (Score:5, Informative)
I have my own business. I keep all my bills, receipts of deductable expenses, home records, and so on. I keep personal records 7 years except in special cases. I just take the bill, when I get it (and most bills are e-mail now!) and put it in the envolope for the biller for that year. At the end of the year I spend less than 30 minutes writing up labels for the next year and when I get time, I burn the stuff that is past 7 years old. For "all those blls" I've never needed more than 4 filing drawers, which can be stacked as one cabinet that doesn't take up much space, or I use the two cabinets (2 drawers each) as legs for one of my desks.
I thought about keeping things electronically, but then I realize I'd have to take time to scan them and file them and that would take a lot more time, over all, than just dropping them in folders. If you want, you can spend all that time scanning. I prefer not to, but then again, I have a life and would rather be cycling or rock climbing than scanning bills.
This, to me, sounds like a geek gone wild, over thinking the solution and trying to come up with a hi-tech answer to a low-tech problem that really doesn't need an answer if one uses a little common sense and simple organization.
Re: (Score:2, Interesting)
Now imagine that those receipts are notes and handouts from your college class, and that you'll want to search them later. Does it still sound like a "geek gone wild?"
Re:Aspect Ratio and Even Lighting (Score:5, Insightful)
I've also found that there is a lot more of value to learn from practical experience than from pedants.
Unless one is a geek gone wild.
Re: (Score:2, Interesting)
Re: (Score:2)
Re:Aspect Ratio and Even Lighting (Score:5, Interesting)
That's what I thought until I actually tried it.
I have an Fujitsu ScanSnap document scanner which I use on all my documents. It scans both sides of a page at the same time, can hold 15 pages (I think) in its feeder tray, and takes 5 or 6 seconds to scan a page. Since it scans both sides of a page at the same time, this actually ends up being 5 or 6 seconds per two pages.
It is small enough to sit on my desk and its "on" switch is the loading tray flap - flap closed is "off".
When I want to scan something, I open the flap, load the tray with the document, and hit the "scan" button.
It quickly scans all the pages and sends the scan to a program called Readiris Pro (v11) - this program will OCR the document and save it into my digital cabinet as a PDF "Image + Text". This is a really cool format because there are actually two "layers" to each page - the actual scan of the page (so it looks right) and then a text layer below that has all the OCR information. What this means is that, although you are looking at a raster image, you can search the PDF for specific information and copy and paste text right out of the document.
Let me clarify that with an example:
Let's say you have a PDF of a utility bill. The PDF you are looking at is a scan of the bill itself - not a text-based representation. However, you can grab the "text" cursor and copy your account number right from the image! Obviously, you are not copying from the image, but from the text layer that has all the OCR'd text positioned correctly on the page, but hidden from view.
Since all the text has been OCR'd, the PDFs are now searchable. Since my digital cabinet is just a collection of folders based on category (Utility, Financial, etc), I use another program (DEVONthink Personal) to index it. Let's say I am talking with my insurance company and they have a question about a claim. I can type in the claim number into DEVONthink and, boom, all the documents which reference that claim will be displayed. Simply clicking on an entry in the result list will bring up the document itself and highlight where the claim number appears on the page. BTW, if a provider allows PDF downloads of actual bills, I can drop them directly into the digital cabinet and they will be indexed along with my other documents.
Yes - this cost a little much to set up ($300 for the scanner (on sale), $90 total for DEVONthink and Readiris Pro), but I was able to sell the full copy of Adobe Acrobat that came with the scanner on eBay for $175, so the actual cost was closer to $225.
It's probably not for everybody, but I am certainly happy with the process.
- Tony
No scanning required (Score:3, Informative)
Re: (Score:3, Interesting)
Thats assuming you need a pristine, perfect photo of the item to be OCR'ed. I suspect this is not the case: chances are that as long as you are trying to digitize printed (not handwritten) documents, the OCR won't mind a little fisheye distortion and offish lighting (as long as you make sure there is enough contrast and no dark shadows.) It really depends on the flex
Re: (Score:2)
I use it to take pictures of whiteboards, documents,
anything rectangular, from the (for me) most convenient
angle. The camera detects the rectangle and warps it
so that it looks like a perfectly angled top (or front)
shot. This works very well for a lot of stuff. Never
tried it for OCR purposes, but from looking at some of
the document photos I have there doesn't appear to be
any obvious problems.
Re: (Score:2)
Re: (Score:2)
When you know nothing about a subject, the first thing to do is closing your mouth and learn.
With any decent lens it's very easy to show anything dead on at close range (not close ra
None of that matters for his purpose. (Score:2)
It's almost impossible to shoot a bill or a check stub dead on, at close rage, without fish-eye'ing, and without getting in your own shadow.
If his purpose is simply not to file paper a scanner is not required. A $200 Cannon from Walmart is all you need if you don't worry about OCR. You move the camera back and use the zoom and it works. I take 1600x1200 pictures of my classnotes and the results are perfectly legible. A good desk light saves your batteries by eliminating the need to flash. You move
I sort of tried this (Score:1)
Focus is easy... manually set it. (Score:2)
I admit that the quality isn't quite as good as the scanner, but it's a heck of a lot more convenient and it's good enough for many uses.
FWIW, macro mode doesn't really work well on my old digicam, plus, I'm not sure it
Re: (Score:3, Interesting)
Using a direct flash isn't exactly the best option. The ink, even though black, may pick up noticable, and troublesome highlights. Depending on the range, it may even lead to uneven lighting on the paper itself. (Having part of the paper brighter than the rest)
Ideally, perhaps you'd want to use softboxes or some other method for more diffuse lighting.
Disclaimer: I'm not really familiar with OCR software though, so I don'
Re: (Score:3, Interesting)
Re: (Score:2)
Sheetfeeder (Score:3, Informative)
Or better yet... (Score:3, Funny)
Re:Sheetfeeder (Score:4, Insightful)
Absolutely.
I tried this, myself, a few years ago. I guarantee that, using a camera, you'll get through, maybe, 100 pages. I got a decent scanner (HP something or other) with a sheet feeder. It does about 12ppm and that turned out to be too slow. I got tired of it in a day or two.
I tried a bunch of different solutions, but I finally had to take it all to work. We had a Fujitsu M4097D and an enormous Ricoh Copier/Scanner/Fax machine. Both did 60ppm, both sides (120 images a minute). I actually made some headway with that setup, but I still didn't finish.
As far as OCR is concerned, don't bother. Even today, it's nowhere near accurate enough. In my experience, the best software out there get an average of one error per page on a really good scan. Trust me: it will take a lot more of your time than you think to fix that. Assuming you're doing mostly black and white text, G4 compression will compress a 300dpi, 8.5x11 image down to about 100k. At that rate, you can store close to 7000 pages on one CD.
Bulk indexing (Score:1)
Is there any software available that can roughly OCR a docu
Re: (Score:2, Informative)
Lots.
You could be OK with GOCR and Apache Lucene if you do not require zoning (working out blocks of text and columns).
Oh it is. You will need to add "variants" to your searches
Re: (Score:3, Insightful)
Once you've OCRd, is there any (preferably Free) software that can parse the text against a grammar and word list and hopefully fix some of these errors? Surely "if there's a digit in the middle of a word, it's probably really the letter with the similar shape," "if an unknown word is a chara
Re: (Score:1)
Some require more OCR machines than variant-based search (lots more) to do the load we do. This would mean more space, bigger air con, lots more cash for little gain.
Some will not give information out in a way our current system can use, so we would have to rebuild large chuck of our system, or scape products and/or work flow procedures.
Re: (Score:1)
Go with a scanner + sheet feeder!
I tried this once (Score:4, Interesting)
If I were a whizz with Photoshop/GIMP/etc, I suppose I could have done some sort of correction to the picture, but...
I've heard how Kinko's have book scanners that will copy and bind a book for you - perhaps they also have a scanning to CD/DVD service? Would that be cheaper for you?
Re: (Score:2)
Getting rid of distortion is tricky, but making the image a proper B&W is easy.
- Image/Adjust/Desaturate
- Image/Adjust/Levels
Since text i
Re: (Score:1)
An image processing tool that would adjust th
Re: (Score:2)
Any possibility of providing an example? I'd be curious to take a stab at it and see if I could offer a tip or two that'd help. The reason I think this would work is that
Re: (Score:2)
Re: (Score:2)
This example does illustrate the distortion problem, though. I don't know if an OCR could actually read this. (I'd be impressed!)
Re: (Score:2)
I'm taking a computer vision class right now, and using some of the techniques I've learned it doesn't look too difficult to create software that would automatically warp the page back to flatness.
I would think the most difficult part of OCRing this would be all that underlining/circling that somebody did (what the hell kind of idiot ruins a book like that?!).
Re: (Score:2)
It took some experimentation at first, but now the process is quite easy. In GIMP just load the digital print, go to Tools->Color Tools->Contrast, increase the contrast, then to Tools->Color Tools->Treshold, and choose a black/white separation that is a good compromise between
Re: (Score:1)
Just last night my daughter asked me to take pictures of her class schedules. Now they're right there in an e-mail and she doesn't have to keep track of a couple of pieces of paper.
I use a camera with good low-light capability (Fuji Finepix F10 - the new F30's even better) without a flash. I adjust the perspective in Paintshop Pro (to make it look like I took the picture straight o
HP ScanJet 4600 (Score:2)
Some people report not being able to get good scans with it, but I've had no problems.
maybe the wrong approach (Score:2)
Re: (Score:2)
Cartoonists and ADF (Score:1)
You're assuming a standard paper format here (Score:2)
Re: (Score:1)
Re: (Score:3, Informative)
Re: (Score:1)
Google is advertising for people skilled with imaging equipment (http://www.google.co.uk/support/jobs/bin/answer.p y?answer=3
only if you need to scan books non destructively (Score:2)
yes if you are archiving rare books there isn't much choice but for most applications sheet feeding or flatbed is fine (yes flatbed without sheet feeding is laborious but i'm not convinced theese "planetery scanners" are any less so)
Why bother with OCR? (Score:2, Insightful)
Re: (Score:2)
Re: (Score:2)
Works great
scanners are FOR documents (Score:5, Insightful)
Get a scanner
Re: (Score:3, Insightful)
Re: (Score:3, Insightful)
Re: (Score:2)
You'd be amazed. (Score:2)
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Because "I'm using a mouse" won't make it to Slashdot?
Also for the chicks.
reverse engineer this.. (Score:2, Interesting)
Must have reliable files -and- reliable system (Score:1, Insightful)
My suggestion is to not take things so far digital with your process. Paper doesn't take up so much space that you can't
Not as easy as you think.. (Score:5, Informative)
I'm currently digitizing my collection of old tabloid punk magazines from the 1970s. I had to use a digital camera because flatbed scanners that do 11x17 or larger are extremely expensive, they're like $3000 or more. So I did some experiments with my consumer-grade 5 megapixel digital camera. The results were adequate, barely (and I have an art degree in Photography, this stuff is easy for me, YMMV). I've currently suspended my project until I can afford a higher rez digital camera, mostly because 5Mp is barely enough to capture the little 6 point type that is used in large sections of the magazines. But let me tell you more generally what I've learned.
First off, you'll need a copy stand. This is a fairly standard photo accessory, but a good copy stand is fairly expensive. You need something that is easily adjustable, so you can raise and lower the camera to get the document to fill the frame, without using too much zooming. The copy stand keeps the camera parallel to the target at all distances. It is important to have quick adjustability in height, rather than zooming. You'd be much better off using a "prime lens" rather than a zoom, as zooms tend to have barrel and keystone distortion.
Secondly, you need lights. If you only want to copy written documents (or B&W magazines like me) you can use cheap spotlights. If you want to do color, you need much better lighting, something with a fixed color temperature, or a flash system. Spotlights are really hot, and when I work in my small office, it gets intolerably hot when I spend about an hour photographing. For better, more repeatable results, you'd be better off getting a flash system. BUT...
Here is the sticking point. You need something to keep the documents flat. That means placing them under a sheet of glass. So you are going to get reflections from the lights, and flash is high intensity lighting which makes it even more difficult to control reflections. The usual method is to put polarizing filters over the lights and the lens, to cancel out the reflections. This is a rather complex method, and a LOW END professional copystand with polarized lighting will set you back about $2500.
OK, so what I did is I adapted my old disused photo enlarger. It was a huge monster for 4x5 negatives, I took off the enlarger head, and used a Bogen photo clamp with a ball-head joint attached to the motorized arm that goes up and down. It does a fairly good job as an improvised copy stand, but it is pretty cramped, the baseboard is only designed to make max 20x24 prints. Also it is a HUGE pain in the ass getting the camera leveled with the baseboard, I use a bubble level. Then I attached a cheap set of tungsten photofloods to the wings of the enlarger, so the light hits the baseboard at a 45 degree angle, to reduce glare. Note that it is best to point each light at the far side of the document, so the light paths cross each other. This gives the light a little distance to fan out and eliminate hot spots. I don't put my documents under glass, they're newspaper pages, so I flatten them for several weeks (!!!) under weights, then if there's a little curl, I use weights (like heavy metal rulers) at the edges, or hold the edges down with post-it notes. That eliminates the need for a glass plate to hold them down, and I don't have to deal with reflections. However, it takes a LOT of time and effort to get the documents positioned and flattened correctly, it is not a quick process.
I use a Canon camera, so I use the Canon Camera Remote to my laptop to preview and take the shot. Even with the lights and some fill flash, I can end up with exposures of 1 or 2 seconds, so I can use a narrow f-stop. This shouldn't be necessary for a flat object, which requires no depth of field, but I find that the lens is sharper stopped down. It takes quite a bit of fiddling to get the optimal
Re: (Score:2)
They're closer to 1000-2500 dollars than 3000 dollars. At least for consumer equipment.
Re: (Score:2)
Re: (Score:2)
Specific model: http://www.newegg.com/Product/Product.asp?Item=N8
I don't have any experience with Microtek scanners, but it'd probably get the job done.
$100 8.5x11 scanner, and scan half-pages? (Score:3, Insightful)
so get yourself an A-size scanner and just scan each page in two parts?
Or if there aren't too many grayscales that you'd trash,
just run it all through a photocopier to shrink to 8.5x11 and scan that?
you are making it too hard. (Score:2)
I'm currently digitizing my collection of old tabloid punk magazines from the 1970s.
That's hard to do but it's not what's required. Snapping legible pictures of a phone bill is not hard if all you want to do is get rid of your paper. Getting OCR is harder, but still not as difficult as making museum grade preservation of artwork.
Be sure to post the results on line some time after the copyright expires. In 2070, they will probably read like Elizabethan English but at long last the public domain will b
Re: (Score:3, Insightful)
BTW, I have privately circulated a few of my PDFs amongst some online punk communi
www.scanr.com (Score:2)
Depending on the quality of teh camera, they sugegst conference room whiteboards; Diagrams, notes, flow charts
Their website has samples of what you'd expect, but basically you're capturing the image and sending it to scanR. They do the conversion and send it back as a
Re: (Score:1)
Re: (Score:1)
Parallax? (Score:2)
Re: (Score:1)
To Clarify... (Score:5, Insightful)
1) lift a lid
2) stick a paper in a well-defined corner
3) press a button
for the hassle of:
1) align a camera on a tripod, including angle as well as position
2) align a paper with no guide
3) adjust the lighting so that you get an even tone
4) make sure you didn't accidentally move the camera, the tripod, or bump the desk
5) step on a foot pedal that you jury-rigged to make take a picture
OR
5) Push a button on a camera that you can't afford to move even a hair.
6) Use image software to continue adjusting the photo so that the OCR will read it properly
7) Hope you did everything right the first time.
I think I'd pick door number 1.
Re: (Score:3, Insightful)
1. stick the paper in the slot, it feeds, scans and files in "New Docs"
2. drag thumbnail to register entry in gnucash, it optionally (sometime in the distant future) ocrs it and tries to find the total and the vendor, as well as matching the last 4 to one of your cards to verify it's going into the right account, then gives you a chance to correct its mistakes. The scanned image is included in the financial db attached to the register entry.
Unfortuna
Real camera solution (Score:3, Informative)
Think of an overhead projector with the camera where the mirror is for vertical adjustment.
2 Have a guide for the paper, not that hard.
3 Lighting is an important one, but as long as it's even the type of light doesn't really matter if you set your white balance correctly.
4 If it is a rigid setup doesn't really matter
5 Use the camera control software on the computer, you don't need to really use a camera.
6 Save the file and run the OCR software.
I use a similar
The solution (Score:1, Informative)
Works for me (Score:2)
A lot of people seem to be critisizing the idea, but there are some uses for it.
I have a multifunction scanner/printer/copy/fax that cost around $100 when it was purchased with a computer a year or two ago. Its great for scanning in receipts for work expense claims, and having soft copies of important paperwork. I used to hand-hold a digital camera, with receipts and papers on a well-lit, flat surface, and photograph in macro mode. Now I'll be going back to that method as I've moved into a very small plac
Re: (Score:3, Interesting)
I agree. I don't know about the OCR thing but I take a picture of everything. Every business card I get, every little receipt or scrap of paper. And why not? Just takes a second and it's done, I always have a digital copy to go back and read or print out if need be.
If I only had a scanner I'd never bother, in fact I had a scanner for years before I had a digital camera capable of doing this and I never bothered then.
Scanner fuss (Score:3, Funny)
Boy, you're right! Who'd want to fuss with a scanner!?
I've tried this... stick with a scanner for now. (Score:2)
I've also used a Casio Exilim camera to photograph pages.
The way that it's done for archival purposes is to have a mount that holds a book and also holds a medium-format camera about four feet away. To get good resolution for OCR you'
Desktop duplex scanners (Score:1)
These units scan both sides of the document in the same pass, at between 4 and 30 monochrome sheets per minute depending on resolutio
Re: (Score:1, Informative)
You could buy one of these... (Score:1)
What I've seen (Score:2)
Back to the sheet feeds, I've worked with the Fujitsu fi-5220C [fujitsu.com]
Not an entirely original idea... (Score:2)
Following World War II, a lot of photography businesses would photograph the discharge and service records of US servicemen and women, so they could present a copy of their records when applying for various benefits offered to veterans. My grandmother did this (the photography) after the war. She was a veteran as well - she worked for US Navy Intelligence working on Japanese military cyphers. She also analysed aerial photographs of enemy positions, but I digress.
Unfortunatly, my grandmother passed away, so
Manga scanning setup with digital camera (Score:2)
http://www.mrdummy.net/mangatranslation/tutorial0
Re: (Score:1)
This setup looks very good. The emphasis on reflection control is very important, especially copying text and images on coated stock.
My first improvement would be daylight-balanced fluorescents (probably circular) rather than straight &/or a bunch of cheap 3M monitor anti-glare screens to polarize the lights. My current preferred lighting is Colortran halogen soft-boxes and a ring light around the lens, but that's specialized and expensive.
I discovered when scanning two-sided pag
Too big (Score:2)
It'll be too big. I have a standalone USB-powered scanner that's 1" thick. (It's now retired because my printer came with an integrated scanner.) It was great when I was in college, because I could stick it in my backpack and take it to the library.
It sounds like you want to perform scanning in a batch job. Perhaps an off-the-shelf solution is better, even if it's slow? (You'll be asleep, at work, ect.) Do the old HP ScanJets allow for batch scanning?