Please create an account to participate in the Slashdot moderation system


Forgot your password?
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. Also, Slashdot's Facebook page has a chat bot now. Message it for stories and more. ×

Comment this is going to get worse (Score 2) 431

The War on Drugs, the War on Terrorism, and ridiculous safety concerns have pretty much killed home chemistry. As a hobbyist or student, buying a chemical beaker or Erlenmeyer flask can get you into legal trouble in some places, and it will probably get you onto watch lists. Chemical kits and sets have been dumbed down so that they contain next to nothing of interest and even their containers for flour (that you provide yourself from your kitchen) carries health warnings. The War on Guns may well kill 3D printing, CNC machining, and metal working if we don't watch out. Software development is threatened by governmental desires to have backdoors into major software systems. Model airplanes and drones are subject to increasing and mostly unnecessary regulations. These developments threaten to turn a nation that has thrived on innovation and technology into a sclerotic empire dominated by bureaucrats and courtiers, like so many before us in history.

This isn't a partisan problem, it's a problem with politics and journalism being dominated by people in both parties who know little about science but who gain power by spreading FUD. Remember that to politicians, people who value science are just another demographic and voting bloc, and that politicians will tell you what you want to hear in order to get elected. If you want technology and innovation to thrive, think about this next time you vote.

Comment whatever makes you happy (Score 2) 29

Docker is just a way of starting processes on top of a union file system, with some simple capabilities management. You can wrap whatever other security features you want around it. Frankly, SELinux wouldn't be my first choice, both because of where it comes from and because I don't like the way it works, but, hey, whatever floats your boat.

As far as SELinux and AppArmor are concerned, what I'd really like to see is being able to install Ubuntu without either package installed. Right now, I seem to be pretty much forced to install both, whether I want to or not.

Comment lots of tough problems in OCR (Score 2, Interesting) 76

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.

And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.

We've been developing an OCR system ( for a while now (see for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.

Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.

Comment difference is resolution, actually (Score 1) 138

The article mentions Google's similar dewarping system; the difference here is speed.

There is nothing preventing Google from pushing high speed video through their book software. In fact, they could probably do that with very little work, since you can use an off-the-shelf high speed video recorder and then just push the frames through the regular processing pipeline.

The reason they don't (and nobody else does) is because it's not useful. For getting acceptable quality from book scanning, you need upwards of 10 Mpixels to get anything decent. Even if you had a 10 Mpixel high speed camera, you still need some control over lighting and camera/book angles for decent results.

Comment digital camera capture and software (Score 1) 222

Because you're talking about capturing with a digital camera, I assume you don't want to cut them apart. If you are willing to cut the spine, then a sheet-fed scanner is the easiest and cheapest solution. You can re-bind afterwards. Or you can get a double wide scanner and probably still scan flat.

If you do want to use a camera, it's important to set up the camera and lighting correctly to make sure you start with good quality images. Watch out for specular highlights.

What you do afterwards depends on the binding. If you can open the magazines flat, then you simply need to intensity normalize and/or threshold the image. There are some good tools for that in the OCRopus OCR toolkit (; have a look here:

If you can't flatten the magazine before capture, you need dewarping software. Dewarping can be pretty tricky. My group has developed some software for it and we're thinking about starting an open source project around it.

There are some on-line dewarping demos here:

And there are some papers on it here:

Slashdot Top Deals

We are Microsoft. Unix is irrelevant. Openness is futile. Prepare to be assimilated.