300μs typo detection for 1.3M words

94 pointsposted 6 days ago
by todsacerdoti

13 Comments

anonymoushn

6 days ago

To do this fast path in about 1μs you could do this stuff:

- Remove the tries. If the detection must work exactly as it does now, you can store a small array of the prefixes and suffixes and just compare branchlessly against all of them. A comparison looks like a vector load, then movemask(cmpeq(prefix, needle) | cmpeq(prefix, all_zeros)) == scalar_all_ones in the case where your needle fits into a vector register (almost all words). edit: you can aggregate consecutive results of individual compares using cmov with the condition being this_mask == scalar_all_ones, but a few branches on this would also be fine. This requires that the prefixes be sorted in descending order by length, so that the first match you find will be the longest.

- Don't heap allocate strings several times during the query.

- Use a cheaper hash than siphash. A reasonable hash for short strings is 2 rounds of AES. ahash is an off-the-shelf hash which will do something like this.

- Use a hash table or hasher interface that lets you separate the hashing from the hash table lookups.

- Use a hash table that lets you prefetch the slot for a given hash. Compute your 3 hashes, prefetch the appropriate slots, then query the strings. If you get the chance to process more than 1 needle at a time, you can hide more latency here - the optimal stride for prefetching ahead of your queries into a large hash table is significantly more than 3.

- stuff about compound words omitted for brevity, but the approach is similar to the above.

The document splitting for 38M hacker news posts also seems like it should complete in a minute or something. Not sure what's going on there.

underyx

6 days ago

Title:

> 300μs typo correction for 1.3M words

Article:

> 300μs for correctly spelled queries and ~5ms/word for misspellings

Aren't these contradictory?

skeptrune

6 days ago

The important number that changed here was the 30ms to 300μs for correctly spelled queries. 30ms for typos wasn't something that we considered a problem and improving to ~5ms was really a side benefit of fixing things for the correct spellings.

Additional latency for actual typos seems ok, but degradation for correctly spelled queries felt really bad. The thing that was really damaging to the UX for us was when latency was spiking for search queries where it didn't need to.

underyx

6 days ago

Sure, but why does the title claim 300μs typo correction?

dang

6 days ago

That's a good point, so we did a s/correct/detect/ on the title above. Thanks!

orlp

6 days ago

The title says 300μs typo detection.

How? If the query doesn't finish in 300μs, return error :)

skeptrune

6 days ago

Agreed. Confirming correct spellings is the basecase of the system and what the title is referring to, but that's not clear.

Sesse__

5 days ago

Also, 300µs _per word_, looking up a word in a corpus of 1.3M words. Not 300µs for correcting 1.3M words!

user

6 days ago

[deleted]

claytonwramsey

6 days ago

It's a little odd that they're using `lazy_static` now that OnceCell and LazyLock are in the standard library.

chucksmash

6 days ago

It's not that weird considering LazyLock has only been in stable Rust since 1.80.0, which is only ~6 weeks old!

It's a great tip though. As an out of the loop/casual Rust programmer who has internalized using `lazy_static!` for this purpose, I probably never would have noticed this update short of lazy_static throwing compiler warnings directing me to use the std lib equivalents (so...maybe never?).

gamegoblin

6 days ago

Tools like GPT and Copilot still generate lazy_static a lot since it's the majority of the training data

skeptrune

6 days ago

Yeah... We might still be kinda rust noobs lol. Should clean that up.