codetrotter
11 days ago
Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.
The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.
In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.
bob1029
11 days ago
Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.
OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.
themanmaran
11 days ago
That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.
The .NORM files (https://xkcd.com/2116)
jimjimjim
11 days ago
The LLMs might help with sequencing the characters you extract from the page but actually getting the contents is still difficult. A number of times I've come across a page where the letters of the text are glyphs in a custom font with no mapping to ascii or anything similar or even more common, especially with output from CAD, are letters that are made by drawing lines in the shape of letters so there is nothing identifiable to extract and you are left with OCRing the page to double check the results
macklinkachorn
11 days ago
In my previous role, I have experienced similar things where the rule-based parsing approach is really tricky to get right and often failed via from edge cases.
We (at https://runtrellis.com/) have been building PDF processing pipeline from the ground up with LLMs and VLMs and have seen close to 100% accuracy even for tricky PDFs. The key is to use rule based engine and references to cross check the data.
spacecadet
10 days ago
Many moons ago worked on extracting 2D CAD drawings from PDFs and converting to full 3D. Fun times.
gsempe
6 days ago
I’m very interested on if you managed to make it works?
spacecadet
17 hours ago
Yes and no. Had some initial prototype but had issues with all of the edge cases related to document formatting and detail and eventually abandoned it.
rad_gruchalski
10 days ago
pdfjs does all of that and it’s pretty solid. I used it recently to extract tabular data out of 10 year batch of bank statements.
GaggiX
11 days ago
It reminds me of: https://xkcd.com/1425/
In the same way now with today's AI models the task is easily achievable.
aboardRat4
10 days ago
mathpix does quite an awesome job actually