Back
4 min read

The HLL 100% Trap

Analytics Data Modeling Data Engineering

Telemetry datasets are often optimized to save costs. This is usually great, but it can lead to confusing results for analysts. In this post, I will explain a not-so-common problem called the HLL 100% Trap.

The Scenario

Imagine you are querying a table called daily_category_failures. This table tracks device telemetry. To save money, the data is pre-aggregated using HyperLogLog (HLL) sketches.

The table has one row per country and category. It uses two main metrics:

  • category_uniques_hll: Unique devices that failed in a specific category.
  • total_failing_uniques_hll: All unique devices that had any failure in that country.

When you look at one category, like Network_Drop, the failure rate looks normal. It might be around 20%. But if you remove the category filter to see a global view, the result jumps to exactly 100%.

“When working with HyperLogLog, you are using sets, not numbers. If you forget this, the data will give you the wrong answer.”

The Mental Model: The Attendance Sheet

To understand this, think of a school with three classrooms. If you count the students in each room and add them up, you get the total seats filled.

But HLL works like an attendance sheet. If a student visits all three rooms, they are only on the sheet once. The “100% trap” happens when you divide the sheet of “students in a specific class” by the sheet of “all students in the school.” Eventually, they become the same list.

The Technical Core: Set Union vs. Summation

The problem comes from how HLL sketches combine data.

Standard counts add numbers together (10 + 15 = 25). HLL is different. It uses a Set Union. If a device has a network failure and a battery failure, the merge algorithm counts it only once.

Why does it hit 100%?

The denominator in our table is “Total failing devices.” If you merge the category failures across every category, you are asking for a list of all devices that failed in Category A, OR Category B, OR Category C.

This list is the same as the total set of all failed devices. The top number and the bottom number become identical.

SetASetBSetC=Total Set\text{Set}_A \cup \text{Set}_B \cup \text{Set}_C = \text{Total Set}

Visualization: The Union in Action

Let’s look at a sample of 5 unique devices across 3 categories:

Device IDCategory ACategory BCategory CDid the Device Fail?
Device 1FailsYes
Device 2FailsYes
Device 3FailsYes
Device 4FailsFailsYes
Device 5FailsFailsYes

If we filter by Category A, the numerator is 2 {Device 1, 4} and the denominator is 5. The result is 40%.

If we remove the filter and merge everything, the numerator becomes the union of all sets: {1,4}{2,4,5}{3,5}={1,2,3,4,5}\{1, 4\} \cup \{2, 4, 5\} \cup \{3, 5\} = \{1, 2, 3, 4, 5\}. We divide 5 by 5, and we get 100%.

Technical Implementation: The Two Scenarios

Here is how this looks in practice with a sample of data for one country.

Sample Unaggregated Data

Each row contains an HLL sketch (a set of device IDs).

countrycategorycategory_uniques_hll (Numerator)total_failing_uniques_hll (Denominator)
USANetwork{Dev 1, 4}{Dev 1, 2, 3, 4, 5}
USABattery{Dev 2, 4, 5}{Dev 1, 2, 3, 4, 5}
USAApps{Dev 3, 5}{Dev 1, 2, 3, 4, 5}

1. The Trap (The “Wrong” Way)

This query removes the category filter but keeps the HLL merge.

SELECT country, HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(category_uniques_hll)) / HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(total_failing_uniques_hll)) AS failure_rateFROM daily_category_failuresGROUP BY country

How the aggregation happens:

StepCalculationResulting SetCount
1. Merge Numerators{1, 4} ∪ {2, 4, 5} ∪ {3, 5}{1, 2, 3, 4, 5}5
2. Merge Denominators{1, 2, 3, 4, 5} ∪ …{1, 2, 3, 4, 5}5
Final Result5 / 5-100%

2. The Filtered Fix (The “Right” Way)

To see the failure rate for a specific category, you must filter for it.

SELECT country, HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(category_uniques_hll)) / HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(total_failing_uniques_hll)) AS failure_rateFROM daily_category_failuresWHERE category = 'Network'GROUP BY country

How the aggregation happens:

StepCalculationResulting SetCount
1. Filter CategorySelect only ‘Network’ row{1, 4}2
2. Merge DenominatorStill the full failing set{1, 2, 3, 4, 5}5
Final Result2 / 5-40%

Conclusion

HyperLogLog is a great tool for performance. It saves space and makes queries faster. But it requires a different way of thinking. When you use HLL, you are working with sets, not simple numbers.

The “100% Trap” happens when the order of operations is wrong. If you match your filters to your data, you can avoid the trap and get accurate results.

Note: I used some AI help for the markdown, tables, grammar, and making this post actually make sense.