The HLL 100% Trap
Telemetry datasets are often optimized to save costs. This is usually great, but it can lead to confusing results for analysts. In this post, I will explain a not-so-common problem called the HLL 100% Trap.
The Scenario
Imagine you are querying a table called daily_category_failures. This table tracks device telemetry. To save money, the data is pre-aggregated using HyperLogLog (HLL) sketches.
The table has one row per country and category. It uses two main metrics:
category_uniques_hll: Unique devices that failed in a specific category.total_failing_uniques_hll: All unique devices that had any failure in that country.
When you look at one category, like Network_Drop, the failure rate looks normal. It might be around 20%. But if you remove the category filter to see a global view, the result jumps to exactly 100%.
“When working with HyperLogLog, you are using sets, not numbers. If you forget this, the data will give you the wrong answer.”
The Mental Model: The Attendance Sheet
To understand this, think of a school with three classrooms. If you count the students in each room and add them up, you get the total seats filled.
But HLL works like an attendance sheet. If a student visits all three rooms, they are only on the sheet once. The “100% trap” happens when you divide the sheet of “students in a specific class” by the sheet of “all students in the school.” Eventually, they become the same list.
The Technical Core: Set Union vs. Summation
The problem comes from how HLL sketches combine data.
Standard counts add numbers together (10 + 15 = 25). HLL is different. It uses a Set Union. If a device has a network failure and a battery failure, the merge algorithm counts it only once.
Why does it hit 100%?
The denominator in our table is “Total failing devices.” If you merge the category failures across every category, you are asking for a list of all devices that failed in Category A, OR Category B, OR Category C.
This list is the same as the total set of all failed devices. The top number and the bottom number become identical.
Visualization: The Union in Action
Let’s look at a sample of 5 unique devices across 3 categories:
| Device ID | Category A | Category B | Category C | Did the Device Fail? |
|---|---|---|---|---|
| Device 1 | Fails | Yes | ||
| Device 2 | Fails | Yes | ||
| Device 3 | Fails | Yes | ||
| Device 4 | Fails | Fails | Yes | |
| Device 5 | Fails | Fails | Yes |
If we filter by Category A, the numerator is 2 {Device 1, 4} and the denominator is 5. The result is 40%.
If we remove the filter and merge everything, the numerator becomes the union of all sets: . We divide 5 by 5, and we get 100%.
Technical Implementation: The Two Scenarios
Here is how this looks in practice with a sample of data for one country.
Sample Unaggregated Data
Each row contains an HLL sketch (a set of device IDs).
| country | category | category_uniques_hll (Numerator) | total_failing_uniques_hll (Denominator) |
|---|---|---|---|
| USA | Network | {Dev 1, 4} | {Dev 1, 2, 3, 4, 5} |
| USA | Battery | {Dev 2, 4, 5} | {Dev 1, 2, 3, 4, 5} |
| USA | Apps | {Dev 3, 5} | {Dev 1, 2, 3, 4, 5} |
1. The Trap (The “Wrong” Way)
This query removes the category filter but keeps the HLL merge.
SELECT country, HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(category_uniques_hll)) / HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(total_failing_uniques_hll)) AS failure_rateFROM daily_category_failuresGROUP BY countryHow the aggregation happens:
| Step | Calculation | Resulting Set | Count |
|---|---|---|---|
| 1. Merge Numerators | {1, 4} ∪ {2, 4, 5} ∪ {3, 5} | {1, 2, 3, 4, 5} | 5 |
| 2. Merge Denominators | {1, 2, 3, 4, 5} ∪ … | {1, 2, 3, 4, 5} | 5 |
| Final Result | 5 / 5 | - | 100% |
2. The Filtered Fix (The “Right” Way)
To see the failure rate for a specific category, you must filter for it.
SELECT country, HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(category_uniques_hll)) / HLL_COUNT.EXTRACT(HLL_COUNT.MERGE(total_failing_uniques_hll)) AS failure_rateFROM daily_category_failuresWHERE category = 'Network'GROUP BY countryHow the aggregation happens:
| Step | Calculation | Resulting Set | Count |
|---|---|---|---|
| 1. Filter Category | Select only ‘Network’ row | {1, 4} | 2 |
| 2. Merge Denominator | Still the full failing set | {1, 2, 3, 4, 5} | 5 |
| Final Result | 2 / 5 | - | 40% |
Conclusion
HyperLogLog is a great tool for performance. It saves space and makes queries faster. But it requires a different way of thinking. When you use HLL, you are working with sets, not simple numbers.
The “100% Trap” happens when the order of operations is wrong. If you match your filters to your data, you can avoid the trap and get accurate results.