Commit 7fa9309
committed
feat: use cardinality estimator for distinct count stats
Replace the exact `HashMap`/`HashSet` previously used to compute
distinct-value counts during compression stats generation with
Cloudflare's `cardinality-estimator` crate. The estimator gives us a
bounded-memory approximation (exact up to ~128 distinct values, then
HyperLogLog++) so high-cardinality arrays no longer require an O(n)
auxiliary hash table to answer the single question "how many unique
values does this have?".
- Integer stats swap the hash map for a `CardinalityEstimator` and
track the most frequent value via a Boyer-Moore majority candidate
plus a second-pass exact count. Sparse/dict schemes only care about
the heavy hitter (>= 90% threshold) or a rough distinct ratio, so
this is behaviourally equivalent for the decisions they make.
- Float and string stats likewise drop their hash sets in favor of the
estimator.
- The integer and float dictionary encoders now rebuild the exact set
of distinct values from the source array at compress time, since
they need the values themselves and the stats layer no longer
retains them.
- `SequenceScheme`'s fast-path check for "all values are distinct" now
tolerates the estimator's small approximation error; the deferred
callback still validates sequences exactly.
Signed-off-by: Robert Kruszewski <github@robertk.io>1 parent 5e5572b commit 7fa9309
42 files changed
Lines changed: 840 additions & 703 deletions
File tree
- encodings
- fastlanes/src/rle/array
- runend/src
- vortex-array
- benches
- src
- arrays
- dict
- compute
- primitive/array
- varbin/vtable
- builders/dict
- scalar
- vortex-btrblocks/src/schemes
- vortex-buffer
- src
- vortex-compressor
- benches
- src
- builtins
- dict
- stats
- vortex-ffi/src
- vortex-layout/src/layouts/dict
- vortex-test/compat-gen/src/fixtures/arrays/synthetic/encodings
- vortex/benches
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| 115 | + | |
115 | 116 | | |
116 | 117 | | |
117 | 118 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
473 | 473 | | |
474 | 474 | | |
475 | 475 | | |
476 | | - | |
| 476 | + | |
477 | 477 | | |
478 | 478 | | |
479 | 479 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
88 | 90 | | |
89 | 91 | | |
90 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
48 | 53 | | |
49 | 54 | | |
50 | 55 | | |
51 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
52 | 61 | | |
53 | | - | |
54 | 62 | | |
55 | 63 | | |
56 | | - | |
| 64 | + | |
57 | 65 | | |
58 | 66 | | |
59 | 67 | | |
| |||
67 | 75 | | |
68 | 76 | | |
69 | 77 | | |
70 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
71 | 83 | | |
72 | 84 | | |
73 | | - | |
74 | 85 | | |
75 | 86 | | |
76 | | - | |
| 87 | + | |
77 | 88 | | |
78 | 89 | | |
79 | 90 | | |
| |||
87 | 98 | | |
88 | 99 | | |
89 | 100 | | |
90 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
91 | 106 | | |
92 | 107 | | |
93 | | - | |
94 | 108 | | |
95 | 109 | | |
96 | | - | |
| 110 | + | |
97 | 111 | | |
98 | 112 | | |
99 | 113 | | |
| |||
122 | 136 | | |
123 | 137 | | |
124 | 138 | | |
125 | | - | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
126 | 144 | | |
127 | 145 | | |
128 | | - | |
129 | 146 | | |
130 | 147 | | |
131 | | - | |
| 148 | + | |
132 | 149 | | |
133 | 150 | | |
134 | 151 | | |
| |||
144 | 161 | | |
145 | 162 | | |
146 | 163 | | |
147 | | - | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
148 | 169 | | |
149 | 170 | | |
150 | 171 | | |
151 | | - | |
152 | 172 | | |
153 | 173 | | |
154 | | - | |
| 174 | + | |
155 | 175 | | |
156 | 176 | | |
157 | 177 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
6 | 8 | | |
7 | 9 | | |
8 | 10 | | |
9 | 11 | | |
10 | 12 | | |
11 | | - | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
| 20 | + | |
| 21 | + | |
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
| |||
35 | 38 | | |
36 | 39 | | |
37 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
38 | 44 | | |
39 | 45 | | |
40 | 46 | | |
41 | 47 | | |
42 | 48 | | |
43 | 49 | | |
44 | | - | |
| 50 | + | |
45 | 51 | | |
46 | 52 | | |
47 | | - | |
48 | | - | |
| 53 | + | |
| 54 | + | |
49 | 55 | | |
50 | 56 | | |
51 | 57 | | |
52 | 58 | | |
53 | | - | |
| 59 | + | |
54 | 60 | | |
55 | 61 | | |
56 | | - | |
57 | | - | |
| 62 | + | |
| 63 | + | |
58 | 64 | | |
59 | 65 | | |
60 | 66 | | |
61 | 67 | | |
62 | | - | |
| 68 | + | |
| 69 | + | |
63 | 70 | | |
64 | 71 | | |
65 | | - | |
66 | | - | |
| 72 | + | |
| 73 | + | |
67 | 74 | | |
68 | 75 | | |
69 | 76 | | |
| |||
72 | 79 | | |
73 | 80 | | |
74 | 81 | | |
75 | | - | |
76 | | - | |
| 82 | + | |
| 83 | + | |
77 | 84 | | |
78 | 85 | | |
79 | 86 | | |
80 | 87 | | |
81 | | - | |
| 88 | + | |
82 | 89 | | |
83 | 90 | | |
84 | 91 | | |
85 | 92 | | |
86 | 93 | | |
87 | | - | |
88 | | - | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
89 | 98 | | |
90 | 99 | | |
91 | | - | |
| 100 | + | |
92 | 101 | | |
93 | 102 | | |
94 | 103 | | |
95 | 104 | | |
96 | 105 | | |
97 | | - | |
98 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
99 | 109 | | |
100 | 110 | | |
101 | 111 | | |
102 | 112 | | |
103 | | - | |
| 113 | + | |
104 | 114 | | |
105 | 115 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4038 | 4038 | | |
4039 | 4039 | | |
4040 | 4040 | | |
4041 | | - | |
| 4041 | + | |
4042 | 4042 | | |
4043 | 4043 | | |
4044 | 4044 | | |
| |||
4054 | 4054 | | |
4055 | 4055 | | |
4056 | 4056 | | |
4057 | | - | |
| 4057 | + | |
4058 | 4058 | | |
4059 | 4059 | | |
4060 | 4060 | | |
| |||
7440 | 7440 | | |
7441 | 7441 | | |
7442 | 7442 | | |
7443 | | - | |
| 7443 | + | |
7444 | 7444 | | |
7445 | 7445 | | |
7446 | 7446 | | |
7447 | | - | |
| 7447 | + | |
7448 | 7448 | | |
7449 | | - | |
| 7449 | + | |
7450 | 7450 | | |
7451 | 7451 | | |
7452 | 7452 | | |
| |||
0 commit comments