dict and the relevant dictionaries are things i pretty much always install on every new laptop. gcide in particular includes most of the famous 1913 webster dictionary with its sparkling prose:
: ~; dict glisten
2 definitions found
From The Collaborative International Dictionary of English v.0.48 [gcide]:
Glisten \Glis"ten\ (gl[i^]s"'n), v. i. [imp. & p. p.
{Glistened}; p. pr. & vb. n. {Glistening}.] [OE. glistnian,
akin to glisnen, glisien, AS. glisian, glisnian, akin to E.
glitter. See {Glitter}, v. i., and cf. {Glister}, v. i.]
To sparkle or shine; especially, to shine with a mild,
subdued, and fitful luster; to emit a soft, scintillating
light; to gleam; as, the glistening stars.
Syn: See {Flash}.
[1913 Webster]
it's interesting to think about how you would implement this service efficiently under the constraints of mid-01990s computers, where a gigabyte was still a lot of disk space and multiuser unix servers commonly had about 100 mips (
https://netlib.org/performance/html/dhrystone.data.col0.html)
totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:
: ~; ls -l /usr/share/dictd/jargon.dict.dz
-rw-r--r-- 1 root root 587377 Jan 1 2021 /usr/share/dictd/jargon.dict.dz
: ~; \time gzip -dc /usr/share/dictd/jargon.dict.dz|wc -c
0.01user 0.00system 0:00.01elapsed 100%CPU (0avgtext+0avgdata 1624maxresident)k
0inputs+0outputs (0major+160minor)pagefaults 0swaps
1418350
: ~; gzip -dc /usr/share/dictd/jargon.dict.dz|gzip -9c|wc -c
556102
: ~; units -t 587377/556102 %
105.62397
nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access:
: ~; mkdir jargsplit
: ~; cd jargsplit
: jargsplit; gzip -dc /usr/share/dictd/jargon.dict.dz|split -b256K
: jargsplit; zip jargon.zip xaa xab xac xad xae xaf
adding: xaa (deflated 60%)
adding: xab (deflated 59%)
adding: xac (deflated 59%)
adding: xad (deflated 61%)
adding: xae (deflated 62%)
adding: xaf (deflated 58%)
: jargsplit; ls -l jargon.zip
-rw-r--r-- 1 user user 565968 Sep 22 09:47 jargon.zip
: jargsplit; time unzip -o jargon.zip xad
Archive: jargon.zip
inflating: xad
real 0m0.011s
user 0m0.000s
sys 0m0.011s
so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability:
: jargsplit; units -t 565968/556102 %
101.77413
and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appending
even in python (3.11.2) it's only about a millisecond:
In [13]: z = zipfile.ZipFile('jargon.zip')
In [14]: [f.filename for f in z.infolist()]
Out[14]: ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf']
In [15]: %timeit z.open('xab').read()
1.13 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)
dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:
: jargsplit; < /usr/share/dictd/jargon.index shuf -n 4 | LANG=C sort | cat -vte
fossil^IB9xE^IL8$
frednet^IB+q5^IDD$
upload^IE/t5^IJ1$
warez d00dz^IFLif^In0$
this is very similar to the index format used by eric raymond's volks-hypertext
https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.g... or vi ctags or emacs etags, but it supports random access into the file
strfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:
: ~; wget -nv canonical.org/~kragen/quotes.txt
2024-09-22 10:44:50 URL:http://canonical.org/~kragen/quotes.txt [49884/49884] -> "quotes.txt" [1]
: ~; strfile quotes.txt
"quotes.txt.dat" created
There were 87 strings
Longest string: 1625 bytes
Shortest string: 92 bytes
: ~; fortune quotes.txt
Get enough beyond FUM [Fuck You Money], and it's merely Nice To Have
Money.
-- Dave Long, <dl@silcom.com>, on FoRK, around 2000-08-16, in
Message-ID <200008162000.NAA10898@maltesecat>
: ~; od -i --endian=big quotes.txt.dat
0000000 2 87 1625 92
0000020 0 620756992 0 933
0000040 1460 2307 2546 3793
0000060 3887 4149 5160 5471
0000100 5661 6185 6616 7000
of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits