Show HN: HTML visualization of a PDF file's internal structure

450 pointsposted 11 days ago
by desgeeko

59 Comments

codetrotter

11 days ago

Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.

The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.

In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.

bob1029

11 days ago

Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.

OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.

themanmaran

11 days ago

That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.

The .NORM files (https://xkcd.com/2116)

jimjimjim

11 days ago

The LLMs might help with sequencing the characters you extract from the page but actually getting the contents is still difficult. A number of times I've come across a page where the letters of the text are glyphs in a custom font with no mapping to ascii or anything similar or even more common, especially with output from CAD, are letters that are made by drawing lines in the shape of letters so there is nothing identifiable to extract and you are left with OCRing the page to double check the results

macklinkachorn

11 days ago

In my previous role, I have experienced similar things where the rule-based parsing approach is really tricky to get right and often failed via from edge cases.

We (at https://runtrellis.com/) have been building PDF processing pipeline from the ground up with LLMs and VLMs and have seen close to 100% accuracy even for tricky PDFs. The key is to use rule based engine and references to cross check the data.

spacecadet

10 days ago

Many moons ago worked on extracting 2D CAD drawings from PDFs and converting to full 3D. Fun times.

gsempe

6 days ago

I’m very interested on if you managed to make it works?

spacecadet

17 hours ago

Yes and no. Had some initial prototype but had issues with all of the edge cases related to document formatting and detail and eventually abandoned it.

rad_gruchalski

10 days ago

pdfjs does all of that and it’s pretty solid. I used it recently to extract tabular data out of 10 year batch of bank statements.

aboardRat4

10 days ago

mathpix does quite an awesome job actually

Muromec

11 days ago

That's pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like https://lapo.it/asn1js/ -- you drop a file and it does all the stuff locally.

swsieber

11 days ago

I've used the iText RUPS (free) for a while for debugging PDFs (as I have the "privilege" to work on code that extracts data from PDFs...). It looks like your introspection stuff might be a bit stronger, which would be great. I'll take it for a whirl.

est

11 days ago

I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP/IP example IIRC.

ddulaney

11 days ago

https://kaitai.io/ maybe?

It looks perfectly nice for its role, but I didn’t use it for my last project because I need serialization as well.

elliottcarlson

11 days ago

Kaitai is very useful for reverse engineering a binary file that you have some assumptions of. I've used it for save file reverse engineering and then creating a read/write library for it. It should be usable for PDF Metadata.

mdaniel

11 days ago

Be careful, "any" is a strong word in this context. Interestingly enough, I actually use PDF as the "hello world" for kicking the tires on any such file format descriptor I find because PDF is such a crazypants specification. Thus, if the descriptor language is able to accurately capture the layout of a PDF, it's obviously well thought out.

I haven't had a lot of luck thus far, except ones which allow escaping out of declarative mode over into "and then run this code"

SSLy

11 days ago

Damn, this is also convenient for forensics and finding watermarks.

pr353n747-0n83

11 days ago

That does sound interesting. Forgive my ignorance, but how could this be used to detect watermarks? Could the same method be used to detect signatures?

edoceo

11 days ago

This tool is pulling out all the metadata in the document. Lots of goodies in there not typically displayed.

user

11 days ago

[deleted]

tyilo

11 days ago

Looks nice.

Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.

desgeeko

11 days ago

Thanks for noticing! You're right, I will fix that very soon.

tyilo

11 days ago

When opening the following hello world PDF, the trailer isn't shown correctly and both `startxref` and `%%EOF` are missing: https://ghostbin.site/bb7jb

tekkk

11 days ago

This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.

kohbo

11 days ago

Do you mean a browser extension? Not trying to be rude; Just making sure I understand.

brailsafe

11 days ago

Since it's a python library, my guess is that they meant a JavaScript or WebAssembly package would be useful.

tekkk

10 days ago

I meant JS or WASM

nonrandomstring

11 days ago

Well done. This is a very useful security previewing tool. PDFs are a menace.

kevmo314

11 days ago

Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.

EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...

desgeeko

11 days ago

Yes, I value simplicity and the interactivity offered by basic HTML and CSS is sufficient for my use case :)

nabaraz

11 days ago

On a similar note, why haven't PDF been replaced? There are XPS, DjVu and XHTML (EPUB) but they all seem to be targeting different usecase (a packaged HTML file).

What I want is a simple document format that allows embedding other files and metadata without the Adobe's bloat. I should be able to hyperlink within pages, change font-size etc without text overflowing and being able to print in a consistent manner.

xp84

11 days ago

I don't think what makes PDF an 'unfortunate' format for (1) editing, (2) on-device reading, and (3) extraction of semantic information (as opposed to presentational information) is any sin on Adobe's part nor 'bloat.'

It's a page description format, not a data format, so all its decisions follow from the need to ensure that you and I can both print the same 'page' even if we use different operating systems, software, printers, exact paper dimensions, etc. I suspect the main reason it holds on so well is that so many things operate in a document paradigm, where 'document' means 'collection of sheets of paper.' Everything from the After-Visit Summary from the doctor, to your car registration document already has a specific visual representation chosen to allow them to fit sensibly and precisely on sheets of paper.

Could HTML (say, with data URLs for its images and CSS so that it can stand on its own), or ePub be a better format in most ways? Sort of, but it is optimized for such a different goal that if you went in to evangelize that switch to everyone who makes PDFs today, you'd be met with frustration that the content will look a bit different on every device, and that depending on settings, even the page breaks would fall differently.

Relatedly, it's interesting to me that even Google Docs, which I suspect are printed or converted to PDF far less than half the time, defaults to the "paged" mode (see Page Setup) that shows document page borders and margins, instead of the far more useful "Pageless" mode which is more like a normal webpage that fits to window and scrolls one continuous surface endlessly.

jimjimjim

11 days ago

Different use cases.

"without text overflowing" brings with it a lot of detail. In pdf every letter/character/glyph of text can have an exact x,y position on the page (or off the page sometimes). This allows for precise positioning of content regardless of what else is going on. It is up to the application that writes the pdf to position things correctly and implement letter or word wrapping.

XPS was the closest to reimplementing PDF but microsoft didn't get enough buy in from other parties so it quietly died.

staplung

10 days ago

An interesting aspect of PDFs that I didn't know until quite recently is that they're a subset of PostScript and that in fact accounts for some of the heftiness. PostScript is a full-on programming language (albeit an unusual one) but PDFs are not (i.e. they're not Turing complete). They do not support control flow and what could be expressed as a simple loop in PS must be unrolled and stored as a series of simple declarations/expressions for a PDF.

The advantage is that PDFs don't need a full program interpreter to be rendered.

idislikelatex

11 days ago

Because as soon as this conversation starts, the LaTeX crowd shows up, and everyone who something meaningful to add as a standard is blocked by that discussion.

sundarurfriend

10 days ago

I appreciate the dedication in making an account just to say this :D

stronglikedan

11 days ago

One reason is that none of those other formats are suitable for commercial printing as-is.

wetpaws

11 days ago

Cause it works and works good enough. Also, immutability is a feature, not a bug

escapecharacter

11 days ago

I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?

freeone3000

11 days ago

This sounds like the format specification? What are you looking for that is not a document?

escapecharacter

11 days ago

I want to drop a specific image in, and have a reader that debugs this. Sometimes images don't follow specs exactly, or stretch them in fun ways, and sometimes this leads to inconsistent behaviour across platforms. Sometimes passing an image through a platform strips or reformats this data.

The current context for me is I'm exploring various non-steganography approaches to embed metadata in photos. In the past, I've built custom formats to embed streaming data side-by-side: https://github.com/dustinfreeman/kriffer

nathan_f77

10 days ago

This is really cool! I've spent the last few years debugging lots of PDFs while working on DocSpring, so I'm always looking for new tools to make this easier. Thanks for working on pdfsyntax!

acabajoe

11 days ago

Kudos to making this self-hosted. So very much appreciated!

adelpozo

11 days ago

it does not have any dependency to a pdf parsing library, correct? That's a cool way to learn to file format and be able to work around weird pdf file. But what was the motivation to not use a library to do the pdf parsing work? is it the case that there is none available? Nice work!

desgeeko

11 days ago

Correct, PDFSyntax implements everything at the lowest level. You can ignore the HTML visualization and use it as an API to access PDF objects. Why? Because I started a very small tool as a week-end project and I got hooked reading the PDF Specification so it is becoming a general purpose PDF library for Python. I am not familiar with other libraries but I have the impression that mine implements things that are often overlooked in others, like incremental updates.

xeon06

11 days ago

Wow, I've been doing some PDF parsing at work and this is going to come in SO handy.

vendiddy

11 days ago

Was mentioned in this thread, but I can also endorse qpdf as being a great library.

It gives you a JSON representation of the PDF data structure. What's nice is that doesn't hide the underlying format but it takes care of a lot of the low level edge cases for you.

disqard

10 days ago

This looks amazingly useful!

Thank You For Making And Sharing!

LegionMammal978

11 days ago

If you're interested in manipulating PDFs, I've found QPDF [0] to be a useful tool. Its "QDF mode" lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.

[0] https://github.com/qpdf/qpdf, https://qpdf.readthedocs.io/en/stable/

zackmorris

11 days ago

Just so we have them, the top links I got for PDF to JSON:

https://qpdf.readthedocs.io/en/stable/json.html

https://www.jsonify.org

https://github.com/maximoguerrero/PDF-GPT4-JSON

PDF is such a curious format. It's not human-readable, it's not well-structured, it's not small. If it weren't for momentum and the political horse trading that Apple, Adobe and Microsoft were doing when the web went mainstream and freaked them out around 1995, I'm not sure that we'd be using it today. Postscript is better in countless ways, but since it's Turing-complete, it's not really ideal for storing static data, and to my knowledge was never extended to handle binary data well, like for embedded JPEGs. I remember trying to print a 10 MB ps file in the 1990s and it took maybe 20 minutes because the grayscale image was basically represented as a bunch of run-length encoded scan lines.

I would argue that frontend web development has reached a similar fate. It seems odd to use programming language (imperative, no less) to design media that we used to describe declaratively. If I had enjoyed success in my programming career, I would work on a declarative representation of HTML/CSS/Javascript that can represent the intersection of all existing markup across all mainstream browsers. Sort of like a mix between Markdown and CSS flexbox like Xcode's auto layout, but universal. It frankly would probably look like HTML, but with sane defaults/builtins/inheritance, as well as a way define and extend components from the beginning, similarly to how people try to use data attributes. For contrast, React and Vue come at this from the opposite direction. I'm talking about something more like htmx.

Then we could work with that format and transpile to HTML or even React Native and dump 90-99% of the boilerplate and build tooling that we use currently.

user

11 days ago

[deleted]