tmbdev - Slashdot User

Comment whatever makes you happy (Score 2) 29

by tmbdev on Wednesday September 03, 2014 @12:37PM (#47817675) Attached to: Bringing New Security Features To Docker

Docker is just a way of starting processes on top of a union file system, with some simple capabilities management. You can wrap whatever other security features you want around it. Frankly, SELinux wouldn't be my first choice, both because of where it comes from and because I don't like the way it works, but, hey, whatever floats your boat.

As far as SELinux and AppArmor are concerned, what I'd really like to see is being able to install Ubuntu without either package installed. Right now, I seem to be pretty much forced to install both, whether I want to or not.

Comment lots of tough problems in OCR (Score 2, Interesting) 76

by tmbdev on Tuesday June 22, 2010 @09:55PM (#32660852) Attached to: Google Adds OCR To PDF and Images

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.

And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.

We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.

Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.

Comment Re:Anyone know what they're using for the OCR? (Score 1) 76

by tmbdev on Tuesday June 22, 2010 @09:38PM (#32660732) Attached to: Google Adds OCR To PDF and Images

FWIW, I believe a lot of OCRopus hasn't been incorporated at Google yet because OCRopus itself is still under heavy development.

Comment difference is resolution, actually (Score 1) 138

by tmbdev on Wednesday March 17, 2010 @04:18PM (#31514502) Attached to: Japanese Researchers Develop World's Fastest Book Scanner

The article mentions Google's similar dewarping system; the difference here is speed.

There is nothing preventing Google from pushing high speed video through their book software. In fact, they could probably do that with very little work, since you can use an off-the-shelf high speed video recorder and then just push the frames through the regular processing pipeline.

The reason they don't (and nobody else does) is because it's not useful. For getting acceptable quality from book scanning, you need upwards of 10 Mpixels to get anything decent. Even if you had a 10 Mpixel high speed camera, you still need some control over lighting and camera/book angles for decent results.

Comment digital camera capture and software (Score 1) 222

by tmbdev on Saturday July 05, 2008 @10:10PM (#24071715) Attached to: Ask Slashdot: Digitizing Old Magazines

Because you're talking about capturing with a digital camera, I assume you don't want to cut them apart. If you are willing to cut the spine, then a sheet-fed scanner is the easiest and cheapest solution. You can re-bind afterwards. Or you can get a double wide scanner and probably still scan flat.

If you do want to use a camera, it's important to set up the camera and lighting correctly to make sure you start with good quality images. Watch out for specular highlights.

What you do afterwards depends on the binding. If you can open the magazines flat, then you simply need to intensity normalize and/or threshold the image. There are some good tools for that in the OCRopus OCR toolkit (www.ocropus.org); have a look here:

http://sites.google.com/site/ocropus/documentation

If you can't flatten the magazine before capture, you need dewarping software. Dewarping can be pretty tricky. My group has developed some software for it and we're thinking about starting an open source project around it.

There are some on-line dewarping demos here:

http://www.iupr.org/demos

And there are some papers on it here:

http://pubs.iupr.org/

http://www.m.cs.osakafu-u.ac.jp/cbdar2007/

http://www.m.cs.osakafu-u.ac.jp/cbdar2005/

Slashdot Top Deals