dataflow
7 months ago
Before you get too excited, this is a probabilistic algorithm, not a deterministic one. Feels weird to call it an "extension" when you lose an absolute guarantee, but still cool nonetheless.
nyrikki
7 months ago
> Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation.
I haven't spent time digging into the implementation details, but the exact get should allow for verification.
It is not uncommon to use probabilistic methods to reduce search space.
dataflow
7 months ago
I haven't dug into the gory details either, but later they say:
> To fully generalize this into a robust data structure, we need:
> (1) A partitioning scheme that creates recoverable partitions with high probability
> (2) An iterative process that uses recovered values to unlock additional partitions
And they also say:
> With proper sizing (typically m > 1.22d cells), IBFs recover the full symmetric difference with very high probability.
It really doesn't sound like this is both exact and also running in linear time like XOR... right? Perhaps somehow the error is one-sided but then the time bound is probabilistic? If I'm missing something and they're truly maintaining both an absolute guarantee and an absolute time bound, this is mindblowing! But I don't get that impression?
nullc
7 months ago
There is no absolute guarantee. You can have an arbitrarily large multiple and the decode can still fail when a set of entries exist that form a cycle, it just becomes quite unlikely as the overhead goes up.
One of the ways of hybridizing iblt and exact algebraic techniques like the minisketch library I link in my other post is to staple a small algebraic sketch to the iblt. If the iblt is successful you're done, if it gets stuck you use take the recovered elements out of the algebraic sketch and decode that. It's fast to decode the algebraic sketch in spite its O(n^2) behavior because it's small, and it'll always be successful if there are few enough elements (unlike the iblt).
Sadly this still doesn't give a guarantee since you might have more elements in a cycle than the size of the backup, but small cycles are more likely than big ones so there exists a range of sizes where it's more communications efficient than a larger iblt.
hundredwatt
7 months ago
You don't lose absolute guarantees, but the probabilistic nature means the process may fail (in a guaranteed detectable way) in which case you can try again with a larger parameter.
The "bloom filter" name is misleading in regard to this.
nullc
7 months ago
> (in a guaranteed detectable way)
To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode. You can make the probability of this as low as you like by using a larger checksum, but the initial HN example was IIRC a list of 32-bit integers, so using a (say) 128 bit checksum to make false decodes 'cryptographically unlikely' would come at a rather high cost since it's a size added to each bucket.
If your sent members are multi-kilobyte contact database entries or something than the overhead required to make false decode impossible would be insignificant.
This limitation also applies somewhat to the alternative algebraic approach in my comments-- an overfull sketch could be falsely decoded--, except the added size needed to make a false decode cryptographically unlikely is very small and goes down relative to the size of the sketch as the sketch grows instead of being linear in the size of the sketch.
I haven't looked at your implementation but it can be useful to have at least 1 bit counter or just make the LSB of your checksum always 1. Doing so prevents falsely decoding an overfull bucket with an even number of members in it, and since the distribution of members to bucket is binomial 2 is an extremely common number for overfull buckets. You can use a counter bigger than 1 bit (and combine it with addition in its ring rather than xor), but the tradeoff vs just having more checksum bits is less obvious.
It's probably an interesting open question about the existence of checksums such that the xor of 2..N valid codewords is unlikely to be a valid codeword... the "always emit 1" function has perfect performance for even values but are there schemes that still contribute distance even in cases were the N isn't completely precluded?
hundredwatt
7 months ago
> To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode
The false decodes can be detected. During peeling, deleting a false decode inserts a new element with the opposite sign of count. Later, you decode this second false element and end up with the same element in both the A / B and B / A result sets (as long as decode completes without encountering a cycle).
So, after decode, check for any elements present in both A / B and B / A result sets and remove them.
--
Beyond that, you can also use the cell position for additional checksum bits in the decode process without increasing the data structure's bit size. i.e., if we attempt to decode the element X from a cell at position m, then one of the h_i(x) hash functions for computing indices should return m.
There's even a paper about a variant of IBFs that has no checksum field at all: https://arxiv.org/abs/2211.03683. It uses the cell position among other techniques.
user
7 months ago