The Hidden Risk in HLLs | Articles

I’ve seen a pretty common misconception show up in analytics discussions over the last few years:

We use HLLs, so the data is anonymous.

That sounds reasonable at first. HyperLogLog sketches only store hashed representations of users. They are probabilistic. They’re lossy. You can’t reconstruct the original dataset from them.

But that doesn’t actually make them privacy-safe. The important distinction is this:

HLLs are hard to reverse, but they are still very capable of leaking information.

In practice, the problem usually isn’t someone extracting user IDs from a sketch. The real problem is what happens when HLL-backed systems allow people to repeatedly query tiny cohorts. That’s where things get dangerous.

First: Can You Extract a User ID From an HLL?

In realistic systems, no. HLL sketches are fundamentally non-invertible. They’re designed for estimating cardinality, not storing identities.

An HLL hashes identifiers, keeps only partial information, intentionally throws away entropy, and stores compressed probabilistic state.

So if someone asks:

Can an attacker recover user_id=12345 from this HLL sketch?

The answer is generally no. That’s not the attack vector people should worry about. You are not decrypting an HLL back into users.

But I think this is exactly where teams develop a false sense of safety. Because while the sketch itself may not reveal identities directly, the query system around it absolutely can.

The Real Privacy Issue

The actual problem is what I’d call a micro-cohort singling-out attack.

Maybe there’s a more formal academic term depending on the exact setup. It overlaps with singling-out attacks, inference attacks, re-identification attacks, and differencing attacks.

But operationally, the pattern is simple: An attacker keeps slicing dimensions until the cohort size becomes tiny. Eventually they get to unique_users = 1.

And at that point, privacy is basically gone. Not because they extracted the user ID from the HLL, but because they isolated a human being statistically.

Here’s What This Looks Like in Practice

Imagine an analytics system with self-serve querying. A user can filter by org, browser, feature flag, country, timestamp, device type, rollout cohort, and experiment bucket.

Now imagine someone already knows a few details about a target. Maybe they know the person works at a specific company, uses Linux, enabled a beta feature, and logged in late at night.

The attacker starts narrowing queries:

Query	Result
Firefox users	400k
Firefox users on Linux	12k
Firefox users on Linux in Org X	7
Firefox users on Linux in Org X using Feature Y	1

That final query is the problem. At that point, the attacker has effectively confirmed that the person exists in the dataset, that they used the feature, and exactly when they were active.

No raw identifiers required.

Why Engineers Miss This

When data engineers or analytics engineers are used to working with extremely large datasets, they often forget about the vulnerability of single small individual bits of data.

I think a lot of data systems accidentally treat hashing as synonymous with privacy. It isn’t.

Hashing protects against direct recovery of values. That’s useful. But privacy failures in analytics systems are usually about inference, not inversion. Those are completely different threat models.

You can build a perfectly non-invertible HLL implementation and still leak highly sensitive information if your query layer allows unrestricted narrowing.

I’ve seen teams focus heavily on salting hashes and securing sketches while simultaneously exposing dashboards that happily answer:

Is there exactly one person matching this filter?

That’s the actual vulnerability.

HLLs Make This Easier, Not Harder

Ironically, HLLs are so operationally convenient that they increase the risk surface. They make distinct counts cheap, fast, composable, and easy to expose in self-serve analytics.

So organizations end up enabling extremely granular slicing because the infrastructure can handle it. But once you support arbitrary dimensions and fast iteration, you’ve effectively created a statistical probing system.

An attacker does not need raw access to data anymore. They just need enough query flexibility.

Presence Disclosure Is Still Disclosure

One thing I think the industry still underestimates is how sensitive presence can be. People tend to think:

Well, we didn’t expose the actual user.

But even proving someone exists in a dataset can matter a lot. For example:

Did this employee access the system?
Did this journalist use the feature?
Did this customer participate in the beta?
Was this person online at this time?
Did someone from this company enable this flag?

Those are meaningful disclosures. And once cohorts become tiny enough, HLL-backed analytics can absolutely reveal them.

What Actually Fixes This: Privacy Engineering

The fix is not stronger hashing. The fix is controlling small cohorts and ensuring statistical privacy. This is fundamentally a query-governance problem.

1. K-Anonymity

aka Minimum Cohort Thresholds

This is the biggest one. Never return counts for tiny populations. Common thresholds are 10, 25, or 50. If the estimated cardinality falls below the threshold, suppress it or return NULL.

2. Differencing Attack Prevention

Attackers usually succeed through repeated narrowing. You should think carefully about how many filters can be combined and whether differencing attacks (comparing two slightly different queries) are possible.

3. Differential Privacy

aka Noise Injection

If you really care about privacy guarantees, eventually you end up in differential privacy territory. This means intentionally introducing uncertainty into results by injecting randomized noise.

4. Generalization

aka Handling High-Cardinality Dimensions

The fastest path to singling someone out is through exact timestamps or org IDs. Generalization (e.g., rounding timestamps to the hour or bucketing IDs) helps reduce the risk of re-identification.

Final Thought

I don’t think HLLs are a privacy failure. They’re excellent at what they were designed for. The problem is when organizations mistake aggregation for anonymity or hashing for privacy.

An HLL sketch may not reveal a user directly, but a sufficiently flexible analytics system can reveal whether a single person exists inside a cohort. And from a privacy perspective, that is often enough to matter.

Key Definitions for Clarity

Anonymization Irreversibly removing or transforming identifiers such that an individual can no longer be identified, directly or indirectly, using any reasonably available means.

De-identification Removing or masking direct identifiers from data, while still acknowledging that re-identification may be possible when combined with other datasets or contextual information.

Pseudonymization Replacing identifiable fields with consistent substitutes like hashes or tokens, where the original identity can still potentially be recovered through additional mapping information or linkage.

Tokenization Replacing sensitive values with non-sensitive placeholder tokens, typically backed by a secure lookup system that can restore the original value when needed.

Aggregation Combining individual-level data into grouped summaries or statistics, such as counts, averages, or distributions, to reduce direct exposure of raw records.

Privacy-preserving A broad term describing systems or techniques designed to minimize disclosure risk, though it does not imply formal or provable privacy guarantees on its own.

Note: I used some AI help for the markdown, tables, grammar, and making this post actually make sense.