Comment: Re:Exactly (Score 3, Informative)

by locoluis (#27173429)

Oh, here's a catch.

Some PDF creators link the character for each font to the internal representation in order of character appearance, not in Unicode order. This means that things like pdftohtml, screen reading or even plain copy/paste no longer work, as they yield gibberish instead.

For example, the string:

"This is a PDF test."

Would get stored as something like:

And pdftohtml yields something like:

Oh, and each typeface gets a distinct ordering, so the same string in different typefaces would probably get encoded differently...

In order to decode this you have to both read the actual graphical characters AND know which typeface is used in each segment of text. Which is a PITA. Otherwise, you're lost.

OCR may or may not be of any help, depending on the typeface used...


