acabal
4 months ago
I'm shocked and saddened to hear this. Greg was a deep source of knowledge and support as I started and shepherded Standard Ebooks. He was generous with his time and experience, and unbelievably patient with me, some guy he had never heard of or met before who was just another cold-email in what must have been an endless stream in his inbox. We should all aspire to his high spirit of camaraderie, charity, and kindness. The world has lost a champion of both literature and the free web.
NoMoreNicksLeft
4 months ago
Why are there no unique numbers assigned to Standard Ebook's ebooks? I understand that there is a cost associated with ISBNs, but it's very irritating to not have something that identifies them uniquely. Most (all?) aren't even in Worldcat, so I can't use OCLC numbers for that purpose either.
everybodyknows
3 months ago
> no unique numbers
This suggests a misunderstanding of the Standard Ebooks process, which allows continual incremental corrections to the authoritative source of individual books (in XHTML, on GitHub). So, a truly unique identifier would only be valid to the production output(s) from a particular state of the Git-repo sources.
https://standardebooks.org/contribute/report-errors
Recall also that final user content is made available in multiple formats, currently at least six. Example:
https://standardebooks.org/ebooks/geronimo/geronimos-story-o...
Asynchronous to the correction process, Standard Ebooks updates its own production tools. So if an individual book's content requires correction, should the "respin" be done with TOT tools, or with the versions available at time of first publication? Disclaimer: I don't actually know which is current practice -- but using the TOT tool suite is obviously vastly easier.
For most practical purposes, I'd suggest the git-commit date, along with short substrings of author name and title, would suffice.
NoMoreNicksLeft
3 months ago
>This suggests a misunderstanding of the Standard Ebooks process, which allows continual incremental corrections to the authoritative source of individual books (in XHTML, on GitHub). So, a truly unique identifier would only be valid to the production output(s) from a particular state of the Git-repo sources.
Well, one of us has a misunderstanding. Just because the printer strikes off the printing number from the colophon for each subsequent printing, they don't actually issue a new ISBN. That stays the same. If they wanted to also include a version number too, I wouldn't mind that as well, but it's not nearly as necessary as this. I use the year as a rough version number in the file names as well.
>Recall also that final user content is made available in multiple formats, currently at least six. Example:
I don't need them to issue a number per file format, but if they want to... that doesn't bother me. That's sort of self-evident which of the formats it is, after all.
>I'd suggest the git-commit date, along with short substrings of author name and title, would suffice.
It doesn't. A number of authors have at one time or another have released books with similar or identical titles that are not the same book. This is the trouble... someone who uses or would use the books is asking for something that is missing but easy to supply, and instead of a "well gee, we never considered that, let us think about it" I have a dozen assholes crawling out of the woodwork to say "no, you're doing it wrong".
I need unique identifiers that are human readable. I just do. The world discovered this need for books before you were born. They invented a global standard, even. There is an entire field of science out there about this, that you seem to be ignorant of even existing. I've been doing this for years, and I keep bumping up against it. But you think it can be solved because you used git and know about hashes or whatever, and it's just like what you deal with in your software development job!
testdelacc1
4 months ago
> very irritating
I think it’s possible to express this in a less caustic way. Because Standard E-books is high quality and free of charge right?
contact9879
3 months ago
the ebook identifier uniquely identifies every ebook. standard ebook ebooks use the url as their unique identifier
NoMoreNicksLeft
3 months ago
Those are poor identifiers. A numeric or short alphanumeric identifier that can be part of the filename is important... I have as many as 5 different editions of the same title so title+author doesn't do the trick. Nor am I putting a url into the filename, couldn't if I wanted to as there are disallowed characters in a url in every filesystem I've ever heard of. How difficult is it to keep a incrementing catalog number like Project Gutenberg does? Anything that doesn't have a proper unique just seems unprofessional.
opminion
3 months ago
NoMoreNicksLeft
3 months ago
This isn't a solution either. Not sure why you think it is. Here's how I name files, just as an example:
Meditationes de Prima Philosophia - GTNB•0023306 (2007) - Descartes, René (aut)
Meditations on First Philosophy - 9780203417621 (2013) - Descartes, René (aut); Haldane, Elizabeth (trl); Ross, G. R. T. (trl) & Tweyman, Stanley (edt,wfw)
Where and how should I put a URI in there, especially considering that they at minimum need the colon (:), which is a problematic character in filenames on NTFS/HFS/APFS/XFS? They're not exactly disallowed, but they create a resource fork or some shit and so it doesn't behave as you would expect. If Standard Ebooks just started numbering their books, then I'd slap the STBK• in front of the number and use that. They're not in Worldcat, or I could use OCLC numbers (but it shouldn't be other people's job to keep the catalog of their own books).contact9879
3 months ago
choose your favorite hash
hash(<dc:identifier>)NoMoreNicksLeft
3 months ago
Hashes are too long, aren't human-recognizable as to meaning, etc. I don't want half-assed workarounds. They need to uniquely number their books.
contact9879
3 months ago
- they don’t need to do anything to conform to your arbitrary organization choices
- hashes are as long or short as you need them to be
- publication timestamp is in every ebook’s metadata, is almost guaranteed to be unique, monotonically increases, and has actual semantic meaning compared to an isbn or oclc
NoMoreNicksLeft
3 months ago
>they don’t need to do anything to conform to your arbitrary organization choices
They don't need to. It'd be smart. It's not "arbitrary". It's fucking library science.
>hashes are as long or short as you need them to be
Hashes might uniquely identify a computer file, but they don't uniquely identify an edition/release of a published book. Some jackass on libgen decides to tweak a single byte, now it has a new hash... but it's not a new edition.
>publication timestamp is in every ebook’s metadata
As someone who takes a look at every internal opf file, no... they're not in every ebook.
You're suggesting I go to the extra trouble of doing a job they could do easily, when I can only do it poorly, and I don't know why... because the first person to respond was a dumbass and thought I was attacking him? I swear, 99% of humans are still monkeys.
ndriscoll
3 months ago
You don't need to hash file contents (though that is often a useful thing to do). You can hash e.g. the URL that was earlier claimed to be the canonical identifier. Running it through your favorite hash function fixes your complaints about file names (choose your favorite hash function such that it is not too long and only outputs allowed characters).
NoMoreNicksLeft
3 months ago
Ah. The url, so I can substitute one difficult-for-human-readability with another difficult-for-human-readability, both of which are excessively long and opaque-by-design.
>choose your favorite hash function such that it is not too long
ISBN's 13 digits is about as long as is tolerable. Any time there is a list of authors six names long (academic titles) along with a subtitle, it's very easy to bump up against max filename size.
This isn't a problem I can solve on my own. Just trying to bring attention to it. My solution thus far is to just avoid publishers who are so unprofessional as to not provide numbers. It's not tough, Project Gutenberg does it. Anyone can do it. If you're some amateur whose entire catalog is 8 books published, you say "this book is 1, and this book is 2" etc, and it's a done deal. Again, I don't expect anyone to use ISBNs (in the US, you have to pay for them unless you're one of the big 5 publishing houses), but just use your own for god's sake.
ndriscoll
3 months ago
Hashes are not excessively long unless you choose to make it so. They might be opaque/random if you want, or they might not. "Remove all special characters and keep only the first 5 characters with space padding" is a string hash function. "Keep only the first 5 vowels with space padding" is a string hash function.
Here's a friendly AI generated hash function to give you an opaque 13 digit number if you're into that:
echo -n "$URL" | sha1sum | awk '{print $1}' | xxd -r -p | od -An -t u8 | tr -d ' \n' | cut -c1-13
For example, for https://standardebooks.org/ebooks/denis-diderot/the-indiscre... you get the ID 4897562473051.
It looks like their ebook sources are all published in git repos online, so you could check out the repos, get the timestamp of the initial commits, and do a monotonic ID on that if you wanted. You could also contribute the change back to them if you think it's something others would benefit from.
9dev
3 months ago
Have a little respect, for fucks sake. This does not belong here.