Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Perl

Journal goon's Journal: complex HTML-->???-->PDF? 4

the problem

  • but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

the site that got me interested in pdf was Stas Beckmans site, www.stason.org. He gave a talk to the melbourne pm last year. Through the course of his talk on mod_perl 2 he showed the notes from his site in html with pdf downloads of the site.

So I tried to re-create this html->ps->pdf so that I too could have a printable version of a project I'm working on called Ratpile (make a directory that has *stuff* stored in it searchable by stuffing information about it into a relational database - data mining some may call it.) using perl+DBI+TT2. The template I created is a *bare bones* html page sans images. This is the technique Stas is using with his docset.

the point I guess I'm trying to make is I've used text only and not images. I've done a bit of research and this is what I've come up with...

  • graphics are supported in postscript (3?)
  • others better (ybiC) than I, have hacked together html->PS->PDF code and appears to handle images via html2ps but not html tables (Create PostScript and PDF versions of all HTML files in given directory )
  • one approach could be to use Matt Sergeants, PDFLib (load_image method) a oo wrapper around pdflib by www.pdflib.com. but I seem to remember has restrictions for use under OSI (has to be opensource, private use or researcher).
  • or use Alfred Reibenschuhs - Text::PDF::API where I found via an old page PDF-API2-0 which has some image (jpg,png,handleing capabilities
  • logreport has an interesting set of observations about html->PDF generation. Namely problems with html formatting and tables

building html->PDF with images and troublesome html tables

now given what we have found above I would suggest the following (unless anyone has a better idea) of using:

  • extract word document to html
  • extract table data (word document via OLE) or (via html via Html-TableExtract - like latter better.)
  • remove html tables in html documents
  • reinsert data into a simple table using <pre> tags for layout and html tags for bolding, emphasis. Or find some other method by experimentation in html for representing tables (text)
  • PDF-API2 as the PDF renderer. This can all be done in code.

the real problem maybe rendering the tables generated from word. complicated layout in word (re-rendered to html) will have to be modified to the postscript syntax then rendered to PDF. The problem is defined by converting the html tables to pdf.

it is not rocket science to create a bit of code to extract the data from the table, re-create a table using PDF-API (and its child modules).

but is there a shorcut?

of course you could forget all the above and take your chances with Michael Frankl's HTML-HTMLDOC and convert you html files directly to PDF :)

credits

damn I love cpan.

This discussion has been archived. No new comments can be posted.

complex HTML-->???-->PDF?

Comments Filter:
  • Why are you creating the PDF's out of the HTML step? Why not turn the DOC file into both HTML and PDF (or PS, where
    the PDF conversion is trivial)
    • Why are you creating the PDF's out of the HTML step?

      Good point. One step conversion is desirable. If I could find a perl module that supports doc->pdf I guess I would use it. do you know of any?

      MS Office and Open Office have no such native support. I think the question came about through avoidance of proprietary converters.

      • MS Office does have Postscript output, at least in conjunction with the printer driver. I believe with using Office with Window's Automation facilities (OLE) you can tell it to print without the standard print dialog. For OS X, I don't think that Word supports the Enhanced Print Apple Event that allows scripts to bypass an application's print dialog.

        I haven't looked at Star Office or Open Office's scripting support enough to see if you can print to a file using a postscript print driver without present
        • available for many platforms, gnu, src and binary - " ...Antiword [demon.nl] is able to convert Word documents to plain text, to PostScript and to XML/DocBook ...

          it's just the tool I need. sure beats the heck out of doc->html-ps->pdf [perlmonks.org].
          While the tool has no perl bindings I can just cron a perl/python script to convert a directory of doc files to pdf.

Anyone can make an omelet with eggs. The trick is to make one with none.

Working...