JackC
9 hours ago
Opinion from 10 years ago, I suspect still valid:
There are a million python libraries and tools to do some overlapping subset of the things you'd want to do with a pdf.
There are no doubt another million in other languages.
These are each basically bundles of some of the transformations you'd want to make to the same underlying data structure.
So, complex pdf scripts often need two or three different libraries to get their thing done, which is wasteful at borh a dev effort and computational level.
The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.
PDF libraries in any language could switch to using that structure and library internally, with the carrot that the switch would result in needing less code, and likely being some combination of faster and safer.
And then if they just exposed get_structure_pointer() and set_structure_pointer(), they could all interoperate for free. (Another carrot for joining -- small libraries could usefully add features and be adopted without needing to pick an existing popular library to glom onto.)
Not sure what would economically cause this to happen, but it would be great.
layer8
8 hours ago
When you write a PDF library, there are design trade-offs all the way down, depending on use cases. (Just “in-memory” is already an important design trade-off, because the PDF format is intentionally designed to not require the whole PDF to be loaded into memory at once.) It would also be antithetical to preferring deep modules with minimal interfaces over shallow modules with broad interfaces [0]. Lastly, in managed environments like the JVM, a C-interface library would come with additional complications and overheads.
[0] https://dev.to/gosukiwi/software-design-deep-modules-2on9
kmoser
4 hours ago
> The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.
> Not sure what would economically cause this to happen, but it would be great.
Writing a library that is better than all the others is difficult to begin with. Continuing to upgrade and maintain it and fix bugs is even more difficult. Even with the right funding, you'd have to find someone who wants to keep at it year after year. When they inevitably lose interest, you'd have to find somebody else to take the reins--and weather the storm of complaints during the down time.
In short, thank you for volunteering to write and maintain this library for the rest of your life! :)
conradev
7 hours ago
The ecosystem would be greatly improved if someone made a great (probably rust based) in-memory low level pdf reading and writing data structure.
https://github.com/J-F-Liu/lopdfwhizzter
7 hours ago
Actually debugging a PDF parsing issue as we speak and actually started writing a parser (partially to understand the issue, partially as a last resort as the code in the parser I was debugging felt a bit shoddy).
The PDF format is frankly quite horrible, extended over the years by kludges that feels more or less like premature optimizations in some cases and bloated overkill in others.
While theoretically a nice idea, the issue is that there is just so many damn object types with specialized properties inside a PDF that you'd basically end up with all complications of a FFI for each binding you'd do to expose a sane subset.
Theoretically one could perhaps make a canonical PDF<->JSON or similar mapping from an established library that most PDF data consumers/generators could use if memory usage isn't too constrained (because the underlying object model isn't entirely dissimilar).
whenc
6 hours ago
You can do:
cpdf -output-json in.pdf -o out.json
(Modify out.json as liked) cpdf -j out.json -o out.pdf
(Disclaimer, I wrote it.)zehaeva
7 hours ago
I don't think this _really_ contributes to the conversation, but I think we can sum this entire post up with just one XKCD comic.
specialist
6 hours ago
> someone made a great ... in-memory low level pdf reading and writing data structure
Are you suggesting Adobe's Core Object Application Programming Interface (COAPI) for PDF isn't sufficient?
Kidding!
I worked on print production software in the '90s. Stuff like image positioning (eg bookwork), trapping, color separations, etc. Adobe's SDKs, for both PostScript and PDF, were most turrible. For our greenfield product for packaging (printing boxes), I wrote a minimalist PDF library, supporting just the feature set we needed. So simple.
Of course, PDF is now an ever growing katamari style All The Things amalgamation of, oops, sorry I ran out of adjectives.
Back to your point: after URLs and HTTP, the DOM is the 3rd best thing spawned by "the web".
The DOM concept itself. Isomorphism between in-memory and serialized. That its all just an object graph. Composition over inheritance.
Not the actual DOM API; gods no.
I understand that API design is wicked hard. But how is it that of the Java tools, only JDOM2 (the sequel) managed to get the class hierarchy correct? So that incorrect usage is not permitted?
(I haven't looked at popular libraries for other languages. I assume they all also fell into the trap of transliterating JavaScript's DOM's API. Like dom4j and successors did.)
I'm just repeating your point (I think) that Adobe should have staked a strong starting conceptual position on PDF internals, what a PDF is. Something more WinForms and less Win32.
30+ (?!) years later, I'm still flubbergasted by PDF's success, despite Adobe's stewardship.
PS- And another thing...
For a print description language, I greatly preferred HP's PCL-5. Emotionally, it just feels more honest somehow. Initially, Adobe couldn't decide if PDF was for print control or documents. Customers wanted documents, so Adobe grudgingly complied, haphazardly.
At least "the web" had/has committees.
mannyv
5 hours ago
"Adobe couldn't decide if PDF was for print control or documents"
Apparently people don't understand the history of PDF. PDF was originally a way to encapsulate PostScript so you could display it on a screen. Unlike PCL, Postscript (and PDF) were device-independent, with a WYSIWYG guarantee. Postscript and PDF are literally the history of WYSIWYG on personal computers and computer-based printing/typesetting.
PDF is not "print control" in the sense of a job control language. PDF has always been about documents, and the features of PDF files can be seen as an attempt by Adobe to both drive and follow the market's evolution of document handling.
PDF is complicated because it's used widely for lots of different things, including printing. And if you've never worked in the printing industry you have no idea how much of a PITA it is.
PDF succeeded for a lot of reasons, but probably the easiest explanation is that they were easier to create - you just printed it and the PDF printer driver spat out a PDF file that you could share everywhere.
sleepybrett
3 hours ago
One of my first jobs was at an isp/web/cohost company. We had a big bank of modems for dialup customers, had some customers who terminated isdn with us, a rack of colocation and built websites as well.
The company was partially owned and housed primarily in a print shop, we worked above the press floor and I was sometimes pressed into service helping when we were slow (I had some experience working in a print shop in highschool (helping with pagemaker and helping to run the big hidleberg), similarly in college.
Nothing like ending your day writing perl cgi scripts and troubleshooting customers damn winsock configurations and then going home and coughing up whatever color was running on the presses that day.