From Paper To PDF? 188
"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.
Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.
My question to the Slashdot crowd is this:
- Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?
We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.
If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?
Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.
The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --
-- is just not cost effective.
If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;
Print a document to a Postscript file, or create one. For example a simple text document is trivial;
- enscript file.txt -p file.ps
Convert the resulting Postscript file to PDF;
- ps2pdf file.ps file.pdf
Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.
Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.
So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "
use DjVu instead of PDF (Score:1)
The files are 4 to 8 times smaller than PDF for B&W documents. the DjVu plug-in is 10 times smaller and a whole lot faster than Acrobat Reader. It runs on Linux/Unix, Windoze and Mac.
The DjVu compressor is free for non-commercial uses, and the decoder source code is available.
Expervision [expervision.com]'s OCR software can read DjVu files. They even have an OCR toolkit for Linux.
Although DjVu supports embedded searchable text, Expervision's engine cannot embed text into DjVu files, only produce a text file (or a number or other formats). For web-based search, you can use a simple CGI script to return the DjVu files that correspond to the text files that contain a match to the search string.
Is it legal to convert PostScript to PDF? (Score:1)
Unless I miss my guess, Adobe has patented the PDF format and only Adobe Acrobat (and other related products) can legally generate the PDF material. We had an Adobe rep out here once, and we asked him about third-party PDF authoring software, and he told us the same thing: all roads to PDF generation lie through Adobe. It's kind of a raw deal, if you ask me, but at least there are viewers for Linux. Proprietary formats are never a good thing, but they could be far worse (see MS Word!)
Anyway I don't intend to tell you how to do business. If you're happy with the way your setup is working for you and you're not worried about the possible legal implications, then by all means, go ahead. Just remember that ignorance is no excuse in the eyes of the law (a lesson that I've learned the hard way a couple of times!)
--
Dale Sieven
Making PDFs (Score:1)
Re:where can one find ps2pdf ? (Score:1)
Does Anyone know of a Open Source OCR that works? (Score:1)
My solution: (Score:1)
Using SANE, we scanned the documents (all text based), and uses enscript to convert them to PS (we didn't use PDF, but you can just filter them with ps2pdf as you already know). There were some problems with a few of the papers, which dealt with the nutrional value of a few hot breakfast foods, but it turned out that the FDA logo at the top was giving the old scanner trouble. It turned out it was set up for text scanning only, so don't forget to check your scanner settings.
All in all, the higer-ups were happy, and so was I.
Re:But wait, there's more - (Score:1)
Re:Postscript more widely used in print houses (Score:1)
How do we use them? Well, we can do a number of things:
1.) Use Adobe Acrobat to export eps's.
2.) Use QuarxXPress 4.x+ and import the PDF as an image file.
The paper I work for also does job printing; we print a number of papers for a few small towns. We get the pages sent to us on a 250MB Zip disk as PDF's. Most of the time we don't even have to convert them. How? Well, our negative maker (or imagesetter) is really a combination of a dedicated "printer" unit and a Power Mac. The Mac handles some of the conversions necessary; our OPI server can actually make 4-color separations of PDFs automatically.
Yes, PostScript is the standard. But, hey, when something doesn't come in as a PostScript file, you convert. Either that, or you don't get paid to do the job, and your competition does so for you.
Re:Adobe Acrobat 4.0 (Score:1)
We used this some time back to scan whole books (including graphics). Just scan the pages in b/w and the pictures in color or line-art and use acrobat to OCR the pages.
The texts were dutch and we ware amazed about the quality. The texts ware about 99% acurate from the paper version!
Definitely a winner in our book!
Cosource has projects working on open source ocr (Score:1)
http://www.cosource.com/cgi-bi n/cos.pl/wish/info/337 [cosource.com]
[Disclaimer: I work for Cosource.]
Re:source-code only solution (Score:1)
Re:ps2pdf produces small files (Score:1)
Re:Other ways... (Score:1)
"The searchability factor is the only reason OCRing is needed in most instances."
That and the need to keep the total file size down to a manageable level.
Generating PDF files (Score:1)
One of my projects is a Java PDF generator library, that allows anything that can be printed in Java to be sent to a PDF file.
To do this, I used the PDF specification that Adobe publish deep on their web site (Can't remember the link, but a search will find it). The first few pages actually encourage third party developers to write their own generators.
The one thing that they do restrict, is that no one can "Change the PDF format", but that is reasonable - why (unless you are a certain unmentionable corporation) would you want to do that?
Re:Adobe Acrobat 4.0 (Score:1)
--
Re:why bother with PDF? (Score:1)
Ok, how many of us just opened another window, went to Freshmeat, and spent 5 minutes searching for such a utility, hoping to score some Karma for a link to it? :)
--
Never search for "Coke" this way (Score:1)
Hint: don't use netscape? (Score:1)
Or "law" of averages (Score:1)
This would sometimes fail where the source document is mis-spelled. A side-effect might be electronic copies better than the original.
Open Source (and Free as in Beer) solution (Score:1)
Rememver to add -acsii7 if MeatheadSystems' Index Server doesn't like Latin1 character sets.
How about searchable images? (Score:1)
Any thoughts from image gurus on the viability of this?
Re:source-code only solution (Score:1)
Monks (Score:1)
For small runs Adobe Acrobat will work just fine (Score:1)
GNU OCR (Score:1)
http://www.socr.org/
Not currently being developed at a notiable rate
Re:If you have a Mac (Score:1)
But wait, there's more - (Score:1)
Anyway, I think the way to do it would be to do the following:
Acquire the images either by scan or by fax. (Or other docs by email or FTP... Why not make this more comprehensive?)
Store them in a database.
OCR them as best you can with the tools available at the time.
Store this OCR'd text in the same row as the image.
Create a field of keywords derived from the OCR'd text and use this for searches.
Now you have a simple database of everything you need. The original image, (or document, or whatever,) and the 'Best Guess' as to the contents of the image.
If a user wants a PDF, let it be created at runtime - Pages 1,2,3...x are the images. The last page is your searchable index of keywords.
If a better way presents itself later, do that.
If a user wants it in HTML, great. You can even embed the images.
The benefits to using a database are this:
You can always go back and re-OCR the image when better Open-source tools are available.
You can search you whole company's documents, not just one at a time.
You are not limiting your users to using one format.
Don't think of this as a process that has to require a lot of user intervention and only gives you a dead-end format!
With this method, you are not limiting the output.
Cheers,
Jim In Tokyo
[Off topic] An absolutely usable solution (Score:1)
Go to:
http://www.zope.org/
... and grab Zope:
http://www.zope.org/Products/Zope/2.1.6/
... and the latest version of ZpdfDocument:
http://www.zope.org/Members/gaaros/ZpdfDocument
We use this where I work (IT dept.) in production.
Except from the fact that it only handles different kinds of text so far, it does run perfectly.
Just click the "report link" and there Acrobat opens. Neat.
Best regards,
Steen Suder
What about Trapeze? (Score:1)
As far as I know, they also do OCR as well. All together, it's pretty darn cool. And no, I don't work for them
Cheers,
Graeme
Re:A possible solution? (Score:1)
Feeding such a program a Postscript file which contains nothing other than an image will not produce the desired output (if any output at all).
It doesn't matter how many times you convert from one overlapping format to another; OCR systems don't just materialize out of the ether, someone has to write them. And so far, those who have done so don't see the need to give them away.
Commercial support (Score:1)
Re:And now for the obligatory... (Score:1)
You're going to give them swords?
Take a look at "Romeo and Juliet"... No, wait, Romeo and his peers were of junior high school age...
Printing Linux HOWTOs (Score:1)
They are created with tools which create documents in several formats. If you want to print the entire document, you should use one of the formats which contains the entire document.
PDF to text, then index. (Score:1)
Or there's the related PDFTOHTML [uni-stuttgart.de] if you prefer that for your access method.
Re:why bother with PDF? (Score:1)
Creating a PDF document from the web app is the best way to make sure the form can be used, since the users may need to have HTML fonts, colors, etc. overriden for their use, but the form must be properly formatted with specified fonts, etc.
DocuLex Alternative to Adobe Acrobat Capture (Score:1)
I haven't had much luck finding anything cheaper. Ideally, I would like something to hook up to our digital copier and convert the scans to
Quick And Dirty (Score:1)
Set up your script to link the OCR page with the original scan. That way, your search engine will most likely be able to get you to the correct page, but if the OCR hoses some important words, you can always just click "original page here" to see what it said. This would allow near immediate functionality of your new database and would allow you to proofread "on the fly" so to speak and correct errors when you find them.
This should be a good solution (even though it is a bit of a hack job) especially if the searcher is familiar with the particular documents and can devise several searches - in case a keyword or two is munged by OCR.
HTLM first maybe? (Score:1)
Printing from Netscape to PS and then using ps2pdf gives nice and searchable results.
Re:tiff2ps (Score:1)
The guy wants to know how to take an image containing text, and create a pdf containing an image, with that text as real text, not a bitmap.
I.e., some software that will OCR the image, grab the text from the image, create a pdf file with that text in, preferably in the same layout as before. If the original image had images with the text, then the images should be preserved in the new document.
Why the poster didn't say so in such a clear manner is beyond me though!
Oh, and moderators, this is "Redundant", not "Informative".
Re:tiff2ps (Score:1)
Re:The age old question (Score:1)
You make copies.
I'm not insulted because I only work to make money. As long as I am paid well, treated with respect and left alone in my private life to enjoy myself as I will, I don't have any compuntions about making copies.
Meanwhile, I can work from the inside of a large corporation to fight for the right of consumers to make copies of things.
Maybe you don't know, but Kinko's ability to make copies for people was hampered by a lawsuit from textbook makers. Kinko's can't make copies of copyrighted things, and are expected to make every effort to prevent customers from doing the same. In spite of the fact that what they want to do might be fair use.
Because we are not legally permitted to make the distinction, we are not allowed to do anything that could possibly infringe.
That, and they give me plenty of vacation, holiday and sick time; schedule around my education; and pay for me to go to school.
Not so bad for just making copies.
Maybe Vividata (Score:2)
At the time the only OCR software that I could find on Linux was from a company called Vividata. At that time they were just adding Linux support and it didn't seem to work for shit, but the support was pretty new.
I use shell scripts to drive SANE programs to do the scanning and conversion to PDF using convert (Image Magick) and then ps2pdf (ghostscript). If the Vividata product actually works now, it might be nice to scan, then OCR, then convert to PDF. A quick index by ht://Dig will then make a nice searchablke archive of scanned docs.
The Vividata products however are not free, if this is a consideration.
--Aaron Newsome
Bad Link (Score:2)
----
Easy! (Score:2)
For some reason it comes on a CD-R with a xeroxed insert. I can't imagine why Adobe would let their packaging standards slip so badly...
Nick
Re:PDF, Ugh. (Score:2)
Re:Interns (Score:2)
I've been thinking about this for a while... can't you just scan and OCR it once, nudge the paper on the scanner, scan and OCR it again, and then use a script to compare the two files? You may use more than two scannings if accuracy is that important.
Something that's been common in the "warez" ebook scene is that people will often correct mistakes in the book as they're reading it, and then spread the corrected version. After a period of time, the book becomes more and more solid.
--
Re:Adobe Acrobat 4.0 (Score:2)
That's why you need the verify stage (Score:2)
So hire two sets of interns or high school kids. Compare the two. Pretty easy. Twice as expensive to get the data in, but it would be more accurate.
Doesn't solve the problem of unreadable original documents which are misread both times, but that's a different story.
--
Re:PDF, Ugh. (Score:2)
Not to mention I tend to prefer free (Beer,speech) software for anything I do and anything I pass along to clients.
Luckily a bit of work with google and I found some guy in england who had written his own PDF libraries (not nearly as nice as PDFlib linked above) which were GPL'd and had enough functionality to do what I needed.
Adobe Acrobat "Paper Capture" can do this (Score:2)
this must be *UNIX problem I guess. (Score:2)
Does this help??
send flames > /dev/null
Re:where can one find ps2pdf ? (Score:2)
Re:Violating copyright (Score:2)
Yes, but their statement about you not being able to do that, is just plain wrong. Just because they say you can't, that doesn't mean you really can't. You didn't actually put your own signature under those words, did you?
If you didn't sign that page of the book, and you didn't get the book directly from the publisher under the terms of some weirdo contract (as opposed to buying it from a bookstore), then the only real restrictions are the ones stated under copyright law. Moving the book into a computer sounds pretty Fair Use -ish to me. Just don't violate the copyright.
---
Mod this up! (Score:2)
--
Compaq dropping MAILWorks?
Re:Is it legal to convert PostScript to PDF? (Score:2)
File formats can't be patented, they can only be trade secrets (I believe.) Otherwise don't you think Microsoft would have patended
--
Re:Missing a step? (Score:2)
Re:I see what I missed (Score:2)
The original article didn't mention which nice public OCR programs he found, so we don't know the capabilities of what he already found.
What he needs is an OCR program which can separate text from images and format the text and images in a similar way on a PS or PDF page. At that point PS or PDF to text programs can be used for indexing.
OT: Opensource OCR (Score:2)
What I'd like to do is enhance the intelligence of OCR, for things like forms. The three things that would be useful is thus...
The ability to define rectangles and lines before OCR happens, so that it will interprete them as graphics as opposed to part of the text.
The ability to Define columns and groups better, and what type of information the column has. For instance Phone numbers, addresses, etc. (and thus quit translating 6 to b
A list of frequent mistranslations pairs - OCR tends to make consistant mistakes - if the spell checker were to substitute for the mistranslation with the alternative character pair, I would recieve a lot fewer misspells.
I figure that those three options would increase the accuracy of the OCR software that I've been using by 95% easily. (The other five percent is from "Fax noise", photocopy fade, and handwritten notes...)
LetterRip
OCR system (Score:2)
Total cost: more than I'm worth.
Value of having 8 million documents in a 2x2 cube: your guess is as good as anyone's.
Errata:
-Number of alternate solutions we looked at: 0.
-Number of comparisons between this and alternate solutions I could find: 0.
-Number of replies I got to a request for comparisons on IWETHEY: 0
-Number of seconds my
-Rank, among the reasons I'm looking for a new job: 2, right behind "Hey let's get Citrix Metaframe so our lame-ass accounting software can track 100 PCs at your location!"
Anyone need linux support in boston?
-jpowers
A possible solution? (Score:2)
Steps for conversion:
1. For pages with images, draw a colored border around each image on each page. Make the color something that will sharply stand out (like bright green).
2. Tricky part - process each tiff image (in a looped script) doing the following:
a. Scan each page to color tiff, with sequential filenames (001.tiff, 002.tiff).
b. Using a custom written utility, build two new tiff images - a tiff of the page without the color-bordered images, and a tiff of the color-bordered image(s) on the page. Number the page images like (p001.tiff, p002.tiff), and the images for each page (p001i001.tiff, p001i002.tiff), so that it is known which images go with what page.
c. Convert each page image to postscript, then to html (unless there is a tiff2html tool out there?) - preserve the filenames (p001.html, p002.html),
modifying only the extension.
d. Convert each image for each page to a (gif, jpeg, png), preserving the filenames (p001i001.png, p001i002.png), with a new extension.
e. Add IMG tags for the images to the end (or beginning) of the html pages, for each page.
3. After batch conversion, go back and proofread/reformat pages (to position images where they should go, etc).
Everything to do this should exist in some form already - except for maybe step 2b - that might be a completely custom tool that needs to be written, but it shouldn't be very hard to code (loop through bytes of image, looking for the sharp color changes - kinda like edge detection code - saving/masking the areas in the outlines)...
Re:A possible solution? (Score:2)
Of course, if such a program existed - tiff -> OCR'd postscript (searchable text), then my solution would work (I am not advocating the manual cutting and pasting of images - a piece of code would have to be written to that) to convert the stuff to html.
Of course, if one went ahead and built an OCR engine (converting tiff to PS), then they could go all the way and add the extra image stuff in and save all the steps I added...
And here I was thinking I was being smart...
You haven't got the right Xerox printer (Score:2)
George
Re:That's why you need the verify stage (Score:2)
If you had the money, you could hire enough sets of high school kids to get a high=-school-kid-RAID going, that way, you could hot swap the sick ones one and not lose any productivity.
George
Best option: TextBridge Pro 8.0 (Score:2)
I was going to use Acrobat Capture, until Adobe ("The Microsoft of the Graphics World") started charging a penny and a half per page. Suddenly, the job went from costing $800 (old Capture pricing) to $25000 (new capture pricing). I even called the Product Manager at Adobe for Capture and asked her why they made such a bold, stupid move. She said that Capture was now a "server product", which justified the price increase. I asked her if she expected anyone to use capture rather than the $80 Textbridge Pro which did the same thing, and she said yes. "You're on the wrong drugs," I said.
To make TextBridge even sweeter, it turned out to be scriptable. I can hand textbridge specialized configuration files for each job. This allowed me to use Perl to automate the conversion of several tens of thousands of TIFF images into multipage, searchable PDFs. Yay, Textbridge!
Apparently, though, Adobe had some words with Xerox (ScanSoft), because Version 9 does not include PDF support. Wankers.
If you can find a copy of Textbridge Pro 8.0 (I think it's the "'97" release), it'll do the trick!
Re: Proof reading (Score:2)
Software like Omni Form will let you designate areas on the page to ignore. This should retain picture elements and will put OCRd text in a layout that resembles the original. This, of course, most likely requires user input, at least for each different page layout.
Re:Adobe Acrobat 4.0 (Score:2)
Re:Adobe Acrobat 4.0 (Score:2)
Re:Missing a step? (Score:2)
Re:Missing a step? (Score:2)
Primitive searchables.. (Score:2)
Entry level commercial products (read: $200, Windows) will export to a
OT: Kind of, but..
Something I would like to see is a OCR search on demand application; In most document management systems you use only image files, and the information is only searchable by meta data.
Re:the OCR situation is not good (Score:2)
Textbridge (on the Mac) has a "verify" function that allows for interactivity. As it is OCR'ing, it seems to run each word through a dictionary, and if it's not found, then it asks you to verify what it should be. This process makes it only a little bit faster than raw typing.
Microsoft would have it otherwise... (Score:2)
Violating copyright (Score:2)
Most books have something along the following lines printed at the front:
All rights reserved. No part of this work covered by by the copyright hereon may be reproduced or used in any form or by any means - graphics, electronic, or mechanical, including photocopying, recording, taping, or information storaeg and retrieval systems - without the written permission of the publisher.
Oops. I hoped that didn't apply to the copyright notice I just pirated from my copy of SNMP versions 1&2, Theory and Practice ! - antoine
The OCR situation is better than you think (Score:2)
Interns (Score:2)
People are doing it... (Score:2)
After you solve the paper to PDF problem... (Score:2)
...could you solve my PDF to HTML problem? I haven't seen any cheap converters for that either. I wouldn't hate PDF so much if I could convert it. I understand that dead tree documents have their place, but that shouldn't come at the expense of on-line documents. Until someone comes up with a free PDF to HTML converter, I will continue to complain to companies and government agencies that post documentation in PDF.
The regular
Review. (Score:2)
Well, as advertised, it *does* convert PDF to HTML in a way that would work very well for text-to-speach software.
It strips *all* formatting, including many br tags. It's really not much better than a plain text converter.
So, if you're visually impared and need to read a PDF, this is fine, but it falls far short of what I want: A true free PDF to HTML converter that does its best to preserve the look of the original document.
The regular
Re:OK MODERATORS (Score:2)
At least check the link before you flame others about marking something as offtopic (*HINT* it points to http://www.microsoft.com and NO SUCH HOWTO exists.)  Duh. :-)
Re: Proof reading (Score:2)
Perfect OCR isn't necessary for searching documents. As long as the OCR is pretty good, you can get pretty good searches. Since the question stated that they want to look at the diagrams, the original image obviously needs to be saved.
One could make the text hidden as suggested by post #27. [slashdot.org]
Re:Bad Link (Score:2)
Re:Adobe Acrobat 4.0 (Score:2)
Looks like it:
From: <Saved by Microsoft Internet Explorer 5> .601E59E0"; 2 19&cid=171
Subject: Ask Slashdot: From Paper To PDF?
Date: Sun, 18 Jun 2000 10:02:56 -0700
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_NextPart_000_0000_01BFD90C
type="text/html"
X-MimeOLE: Produced By Microsoft MimeOLE
V5.50.4029.2901
This is a multi-part message in MIME format.
------=_NextPart_000_0000_01BFD90C.601E59E0
Content-Type: text/html;
charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
Content-Location: http://slashdot.org/comments.pl?sid=00/06/05/2353
Et cetera. It's even saving it as if it were an mbox entry... Don't get much more open than that... MIME, HTML, BASE64.
Re:The age old question (Score:2)
Bingo. If you have only a one man staff for over 100 people that is... In that case:
Windows 9x: $250 gets you an OS in a box. Hope you like it. Supporting it costs very little because you can do very little with it. Like "A meal in a can" it's server capabilities are laudable only as an example not to follow -- don't pack so much crap into something that is already bursting at the seams.
Windows NT/2k: $$$$$ gets you an OS in a paper sleeve. It doesn't matter wether you like it or not because once the managers see it you are stuck with it. Supporting it costs very much because you can't do anything with it properly. Takes about 1 server for like 10 clients. Sorta like duct tape when it is used on anything but ducts.
Linux: No money gets you an OS on an FTP site. For one man, supporting that many users is going to cost extreme $$$$$$. But you can do it all on one machine. Just like a big swiss army knife.
Of course, a smart company (too bad these don't exist) would hire 5 people (one per 20), run Linux, and buy X-Terms. This is cheaper than ANY of the Windows solutions I have ever seen...
Just my $0.02
Re:Effective Solution (Score:2)
OCR can retain formatting (Score:2)
Omni Page [caere.com] has excellent capabilities for OCR that will scan and retain most, if not all formatting. It also supports this with WordPerfect, not just the Redmond brand X software that that you see around.
Unfortunately, it still requires a win9+ machine, but otherwise it falls into the category of Really Good Stuff(tm)
They were separate from TextBridge a while back, but the companies merged during the past couple of years.
The other option is to see if the compnies have copies of the books available on CDs, etc. this depends on the company, of course.
The Holy Grail (Score:2)
One in particular that comes to mind is an auto insurance place. all of those customers who have to process stuff yearly, etc. nevermind the usual database issues...
if you figure it out, you have the makings of a great business plan.
Re:Is it legal to convert PostScript to PDF? (Score:2)
PDF, Ugh. (Score:3)
I ended up searching for three days (and submitting an ask
Moral of story: test the technology before selling to a client. And trying to generate PDF's on the cheap is only for those who have way more time than money!
Re:Is it legal to convert PostScript to PDF? (Score:3)
Torrey Hoffman (Azog)
PDF XML (Score:3)
It'll be out in the next week or so; check Freshmeat.
The idea behind it is, create a nice layou template in the tool of your choice -- Illustrator, for example. Save as PDF. Convert to XML. Add your markup to it -- extra text, etc., convert back to PDF. Done!
Release 1.5 will include a "template" feature, whereby you can use pages from existing PDFs as templates directly; something along these lines (pseudocode):
p = new pdf();
t = new pdftemplate("foo.pdf");
p.newpage("8.5","11");
p.include_from_template(t.page(1));
p.drawstring("Hi!");
p.write("bar.pdf");
Does this type of tool sound interesting to anyone?
On a related note, we plan to offer it as both open source and a commercial product. For instance, the ActiveX interface would be commercial. You could negotiate a commercial license. And you can use it under something like the Alladin license (a la ghostscript, pdflib, etc). Any advice on open source + commercial? I have to justify my department's budget.
Other ways... (Score:3)
Some useful sites:
PDF Research [pdfresearch.com]
Planet PDF [planetpdf.com]
AcroBuddies [acrobuddies.com]
Codecuts [codecuts.com]
PDF Zone [pdfzone.com]
Adobe [adobe.com]
Deja.com [deja.com]
I did that in two hours... (Score:3)
1) Write LaTeX resume style class. Mine's pretty primative because it only has to deal with my resume.
2) Create resume using resume style.
3) pdflatex resume.tex.
Or...
3) latex2html resume.tex (Though latex2html doesn't really generate it to look the way I need it, but it is just a simple perl program so you could always hack it.
Nice thing about LaTeX is you can also go to XML or DVI or RTF or a number of other fairly widely used formats. Or you could just ship the raw LaTeX if the company you're dealing with is that clueful.
the OCR situation is not good (Score:4)
Last year, I tried several Linux-based OCR packages, and they basically didn't work at all.
I ended up using the Windows software that came with my scanner to OCR the documents, and at first glance it appeared to do a good job -- it didn't mess up too often. But then I went in and actually proofread and spell-checked its output to find all the typos it had made, and it turns out that this process was so time-consuming that it was faster for me to just type it all in by hand. Even though the OCR software only made a mistake every few lines, finding those mistakes took enough concentration that typing the whole thing took less time.
Your mileage may vary, according to how fast you can type.
embed TIFF images in the PDF (Score:4)
So, a simple conversion would consist of just putting the scanned TIFF images in sequence into a PDF file.
Re:why bother with PDF? (Score:4)
With PDF, you can design and lay out your ad and transmit it electronically (or on disk) to the newspaper, knowing that it will print exactly how it it did for you. Or you can lay out your brochure and send it off to the printers knowing the same thing. With any other format, the publisher/printer's machine is going to have at least one (oh, if only it were ever just one!) setting different than yours, which will change the layout.
PDF is the way that print ads are submitted electronically today. It's either PDF or old-fashioned cut-and-paste (no, even more old-fashioned than you're thinking, I mean with actual scissors and glue). The Associated Press runs a "wire service" called AdSend for ad agencies to transmit PDF ads electronically to newspapers and magazines -- and they are transmitting millions of PDF's a year.
The same thing basically goes for sending anything you want printed to a print shop. In any case, free PDF-making software enables dead-tree publising the same way that the web enables electronic publishing (though we haven't got any print shops that'll work for free, yet :-)
========
Missing a step? (Score:4)
Scan to OCR to PS to PDF
there are apprarently a couple tools to do this for you. check out a brief list here [umd.edu]
Seeing as you've looked into Adobe Capture, windows may be an option. If so, then the other question would be whether you've looked into Textbridge [scansoft.com]? This looks like it would do exactly what you're asking. No muss, little fuss.
The age old question (Score:4)
The short answer is using OCR to create a text file, proof reading the text file, and then printing to a postscript file.
The long answer is, you need to find quality OCR software that does not choke on things like forms. You also *MUST* proof read every OCRd document. No OCR is perfect, and drawn elements will almost certainly trip the software into embedding odd characters or pipes into your text. Different fot sizes will cause the software to choke. Thin fonts will cause the software to choke.
If you are OCRing forms, I recommend Omni Form (it's the only software I know of that recognizes forms, but I have never used it personally).
Batch processing of OCR pages is likely easy to set up with professional OCR software (Omni Page does it), but it does not excuse you from proofreading the results. After that, the PDF part is a snap, and can be accomplished with any OCR software you choose to use.
If you are asking which OCR software is, I can't help you directly. OCR software is a niche software market, and you either get free, dissapointing software with your scanner, or you pay big money for something that does a decent job. Just like everything else in life. Have you read any OCR software reviews?
A former intern... (Score:4)
So, I ended up being the cheap labor to get the stuff together, but I incorporated the error checked suggested by the other replies, and I utilized OCR to minimize carpel tunnel damage.
Yeah, it took a while, and yes I got paid little in comparison to the other people at the location, but I got paid, they got their silly meeting minutes online, and they didn't have to hire 1,000 monkeys with 1,000 type-writers and have redundancy of people or invest in vast warehouses of paper feeders.
The scale of my work: I worked on a series of bound volumes that took up 3+ feet on a bookshelf and I completed the work on my own in less than 2 weeks (while also feilding tech support questions from the group). If you have 1,000,000 pages to be put online yesterday, maybe you could use a larger staff - but always remember:
If it takes a farmer 3 days to plow a field, and 3 farms only a day to plow the same field, and it takes one woman 9 months to have a baby, how many months does it take 9 women to have one baby?
Often putting more people on a project doesn't equate to faster solutions or better ones and usually not cheaper ones.
Adobe Acrobat 4.0 (Score:5)
In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html [adobe.com].
Save money on OCR by sacrificing quality (Score:5)
I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.
On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.
Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.
The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.
Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:
Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.
So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.