hackernews client

quuxplusone

3 months ago

TFA correctly points to (subnet-structure-preserving) encryption as the right way to anonymize IP addresses, although for some reason it calls it "IPCrypt" instead of "Crypto-PAn."

https://en.wikipedia.org/wiki/Crypto-PAn

comex

3 months ago

Anonymization is supposed to be irreversible. This scheme is reversible by whoever has the key. I don't really get the point of it.

true_religion

3 months ago

Any stable hash can't truly anonymize IP addresses because there is a finite amount of outputs easily computable via ordinary machines.

atoav

3 months ago

Which is why we pepper and salt our hashes.

If you store the blood type of a patient hashed, the problem is that there are only so many blood types. So the same blood type will have the same hash value and attackers could (1) just infer statistically which are which, (2) crack one and get the rest and (3) group users even without cracking the hash.

That means we need to ensure the input values are getting more complex by prefixing them with secrets from elsewhere.

If you have one secret (e.g. stored in an environment variable) that would be the pepper. Adding pepper just makes cracking harder, but since it is the same for each value, it is not enough. But since it is not stored next to the input value it makes attacks harder.

A salt would be a per value secret that is stored for each blood type and prepended on hash.

The two in combination make it much harder to get from the hashed value to the input value without having both salt and pepper.

47282847

3 months ago

That’s encryption at rest, but not anonymization, unless you throw away the salt and pepper, at which point the record becomes meaningless since it cannot serve for future comparisons.

atoav

3 months ago

This can be anonymization, if you throw away the key. If you keep it, it worse than encryption since now attackers can also differenciate subnets.

quuxplusone

3 months ago

Right. In fact "data destruction" itself can be implemented as "encryption" plus "throwing-away-the-key" plus (importantly!) "throwing-away-the-plaintext." If you don't throw away the plaintext after encryption, you're really missing an important step. ;)

"IP anonymization" is kind of a subset of "data destruction." We want to destroy some of the information — like, "is this address 127.0.0.2?" — but we want to preserve some of it — like, "is this one address in the same /24 subnet as this other one?". That's because we want to be able to say things like, "50% of our traffic comes from a single /24. Its anonymized name in this dataset is 28.238.72.0/24; we can't tell you what its real name is because we anonymized that away."

If your threat model includes things like "We really want not to be able to say things like that about our dataset," then obviously you should not use (only) anonymization. Because the whole point of anonymization is precisely to preserve the ability to say things like that about subnet structure, while anonymizing away the real addresses.

Perhaps it should have been called "IP pseudonymization." I would have said that ship has sailed, but after googling "ip pseudonymization" it seems like maybe precise terminology is trying to make a comeback due to things like the GDPR.

https://portolano.it/en/newsletter/portolano-cavallo-inform-...

> In the General Court’s opinion [...] the identifiability of the data subject should be assessed taking into account the concrete possibilities of the third-party recipient to identify data subjects. As such, when sharing pseudonymous data, the same must be considered anonymous if the recipient has no means to re-identify data subjects.

> [S]ince the third-party recipient did not have access to the additional information capable of identifying the data subjects, nor could it in any way have acquired such access, the transmitted data should be considered anonymous data and not pseudonymous data.

waynesonfire

3 months ago

We would also truncate lat/lot coordinates.

bashtoni

3 months ago

Can we get a tag for AI slop generated articles like this one?

If the author couldn't be bothered to write it, why would anyone think we should bother to read it?

Sophira

3 months ago

Why do you feel this was generated by AI?

IP address truncation fails at anonymization

10 Comments

quuxplusone

comex

true_religion

atoav

47282847

atoav

quuxplusone

waynesonfire

bashtoni

Sophira