Designing a Fast Concurrent Hash Table

37 pointsposted 7 hours ago
by burntsushi

8 Comments

winwang

14 minutes ago

Are there "batch"-esque workloads where we want extreme throughput, but can tolerate a large latency (10-100 micros)?

_nalply

2 hours ago

Have also a look at concurrent_map. It is a concurrent BTreeMap, this means it maintains key order (not insertion order but the keys implement Ord).

https://lib.rs/crates/concurrent-map

I am using it for a transactional in-memory key-value store experiment.

eterm

2 hours ago

I'd love to see this put through it's paces with the following:

I have a strategy for benchmarking such data-structures, I call it the "birthday benchmark".

Unfortunately I'm not proficient enough in Rust to have put it into practice there, but the strategy is this:

Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.

Consume the stream in blocks of n bytes.

Try to find a duplicated block.

Given the "birthday paradox / birthday attack", this is actually much quicker than it might first appear. Simple single-thread basic hashset can find duplicates at a 6-byte length match in around 3 seconds on my hardware, and a specialised data-structure takes this to less than a second.

A good concurrent hashtable should improve that greatly even further, because this ought to be a very parallelizable problem, especially if you're not constrained to finding the first such duplicate in the stream, but allow yourself to find any duplicate, nor are constrained to keep track of both sides of the pair, and are content with simply knowing you have found one.

bloppe

an hour ago

This is a write-heavy workload. Papaya is optimized for read-heavy workloads.

It's an interesting benchmark, but the kind of people who would want to use Papaya probably wouldn't be very interested in it.

winwang

11 minutes ago

Do you have other examples of interesting workloads? Benchmarking is difficult, lol.

vlmutolo

2 hours ago

> Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.

Current PRNGs are pretty fast. The Xoroshiro RNG "shootout" benchmark [0] lists some self-reported speeds. They claim 8+ GB/s for even their slowest, highest-quality PRNG. The general-purpose one they recommend is 10GB/s, and 32GB/s when vectorized.

The vectorized versions get close to current DRAM speeds. I think I'd prefer that over reading from a giant table, given that the table reads will have significant implications for caching and disrupt that aspect of the benchmark.

[0]: https://prng.di.unimi.it/#shootout

eterm

an hour ago

Thanks for that suggestion, I shall explore faster PRNGS and vactorization too then.

My domain is c#, where it's perhaps unsurprising that it's a lot faster to have an array in memory than go via the default Random, which is I believe is Xoroshiro ( at least in .net 6+ ). It certainly can generate data quickly with Random.GetBytes(), but repeated calls to Random.NextInt64() are much slower.

Another issue I found with generating random numbers and trying to multi-thread is thread safety / synchronisation. If each thread is doing it's own random generation then it's difficult to compare results between using 1 thread and 2 threads, because it's no longer the same single stream of random data measured across each benchmark, so the meta-randomness of how early you "should" hit a dupe becomes a factor.

Having pre-generated numbers and single thread responsible for handing out those numbers makes it easier, you can either have a thread-safe counter, or can increment a larger increment plus an offset for each thread number. The first "real" dupe is still in the same place in the original pre-generated data.

You can then compare the incremental gains, for example, of each thread having their own hashtable would get you logarithmic gains (I think that's the right way to express it, but essentially it's gaining just by having First Dupe = Min(First Dupe across N threads) vs an actual thread-safe concurrent hash-table, where you should[1] still see First dupe but sped up by a greater factor than just a naive work but not state sharing.

I recognise there are potential memory caching issues at play with the pregeneration approach, but for larger bit counts the actual work should hopefully dominate, particularly since look-up values aren't revisited by design.

[1] "Should", because there's a small chance in many algorithms that due to concurrency and race conditions you actually miss the duplicate. Either multiple runs should be tried looking for bi-modal data, or accept that the next dupe shouldn't be so long after the first, and be happy the speed-up is greater than the slow-down caused by very occassional race condition misses. The chance of a such a race condition is absolutely minuscule at the 48bit level, if we assume we are using say, 8 threads, and assume for a race condition to occur, the same 48-number would have to be concurrently being handled/generated by 2 different threads.