Making a 16-second query run in 8 ms
Recently I was writing a recommendation system, as I started including more things in my index it got surprisingly slow. I found out that one of the operations it was doing can hit a wall way faster than I would expect .
I was ranking ~5.7k small items(48k feature space) on a 4 GB GCP box: score each item against the user, return the top matches. Postgres took seconds, and got worse as the index grew. Luckily I managed to eventually solve that part. At the ~5.7k index (1×) and 10× it (~56k):
| warm 1× | warm 10× | cold 1× | |
|---|---|---|---|
| baseline | 1,376 ms | 14,888 ms | 16.1 s |
| final version | 8.45 ms | 90.6 ms | — |
last thing to mention: Precomputing is out the user's set changes every request, so the ranking runs live each time.
Why it's slow
Each item is stored as a JSON object, a map of feature → weight, a few hundred entries out of a
~48k-feature space. The query pulls the weights back out of those objects on each request, which
expands to ~2.2M (item, feature) rows that then get aggregated against the user's set:
-- explode each item's object, join the user's set, sum the matched weights
SELECT c.item_id, sum(f.weight::int) AS covered
FROM items c, jsonb_each_text(c.features) AS f(feature, weight)
JOIN user_set s ON s.feature = f.feature::int
GROUP BY c.item_id;
TOAST1 I/O is the obvious suspect. Those JSON objects are big enough to live out of line, so reading the index looks like a wall of random fetches.
Warm, it's CPU
With the index resident it still takes 1.4 s, and that goes into manufacturing and hash-joining the ~2.2M exploded rows, not into JSON parsing.
Cold, it's I/O
TOAST was right there: forced cold, the query is 16 s, dominated by random TOAST reads off the network disk.
It won't stay warm
| regime | when | cost | cause |
|---|---|---|---|
| cold | after a deploy, or when the pages are evicted | ~16 s | random TOAST detoast I/O |
| warm | resident in cache | 1.4 s | CPU explode + hash join |
The whole database doesn't fit in RAM, so unrelated traffic keeps evicting the feed's pages: open one item and the index falls out of cache, and a deploy clears it entirely. Today that's a per-request swing from 1.4 s to 16 s. The worry is the trajectory: as the database outgrows RAM the box spends a growing share of its time evicting and re-reading pages instead of serving, and under load that tips into thrashing. A bigger box buys back the cold half: with enough RAM the corpus stays resident and the cold path never happens, so cold is really a small-box artifact. The warm seconds barely move, though, and they're the real problem.
It isn't the representation
Three other shapes score the same batch:
-- int[]: unnest two parallel arrays (no JSON anywhere)
SELECT c.item_id, sum(f.weight) AS covered
FROM items c, unnest(c.feature_ids, c.weights) AS f(feature, weight)
JOIN user_set s ON s.feature = f.feature
GROUP BY c.item_id;
-- normalized: the (item, feature, weight) rows already exist on disk
SELECT f.item_id, sum(f.weight) AS covered
FROM item_features f JOIN user_set s ON s.feature = f.feature
WHERE f.item_id = ANY(:items)
GROUP BY f.item_id;
-- sparsevec: no rows at all, one C kernel per item
SELECT c.item_id, -(c.features <#> :user_set) AS covered
FROM items c;
int[] and the normalized table are the jsonb query again, the same join and sum with no JSON
anywhere; only the row source changes. Where each lands, at 1× (~5.7k items) and 10× it (~56k),
warm median on the non-burstable 4 GB box2:
| representation | warm 1× | warm 10× | cold 1× | note |
|---|---|---|---|---|
| jsonb (no-fence) | 1,376 ms | 14,888 ms | 16.1 s | baseline |
| int[] columns | 1,922 ms | — | — | |
| normalized (clustered) | 1,718 ms | — | 3.7 s | 4× cold, still seconds warm |
| pgvector sparsevec | 447 ms | 4,699 ms | 0.6 s | best SQL can do, still 4.7 s @ 10× |
The same ~2.2M rows get manufactured and aggregated every way.3 Only sparsevec changes the
shape, running the overlap as a C kernel instead of a row explode. That
buys a stable ~3× over jsonb, but it's still 4.7 s at 10×. It can't random-access either, so it walks
the whole ~14k user set per item; the bitset later deletes that walk.
It's a sparse matrix-vector product
Every representation lands in the same place because the work isn't relational. Squint
and it's arithmetic: line the items up as the rows of a sparse matrix D (D[i][f] = weight of
feature f in item i, ~386 of ~48k nonzero), the user's set as a 0/1 vector k, and the whole
ranking is one matrix-vector product4:
covered = D · k # one covered-weight per item
Its cost is one multiply-add per nonzero of D. A relational plan materializes every (item, feature) pair as a tuple instead, ~2.2M per request, then hash-joins and groups to recover the same
sum. What the SQL executor actually runs:
for each item:
for (feature, weight) in item.features: # a tuple per pair, ~2.2M rows total
emit (item_id, feature, weight) # materialized into a tuplestore
hash_join(those rows, user_set)
group by item_id: sum(weight) where feature ∈ user_set
The same computation as a plain loop in app RAM, no database involved:
user = bitset(feature_ids) # built once from the user's set; membership is O(1)
for each item: # the matrix is resident, int-encoded
covered = 0
for (feature_id, weight) in item: # ~386 ids, a tight array loop
if user.test(feature_id): covered += weight # no tuples, no join, no allocation
Nothing about that loop needs a database: hold the int-encoded matrix in RAM as contiguous
(feature_id, weight) arrays, make k a dense bitset5, and each item is a walk over its ~386
nonzeros, one O(1) bit test each. That's 8 ms6 at the index's scale and 90 ms at 10×.
The matrix-vector shape isn't the whole win. The explode touches roughly the same ~2.2M nonzeros the
loop does, so a lot of the gap over SQL is the cost per step, not their number: the bitset is ~6 KB
and stays in cache, each item's ids and weights are contiguous and prefetch-friendly, and the loop
allocates nothing, so a nonzero is a few cache-hot instructions. At that same step the SQL executor
manufactures a tuple (heap_form_minimal_tuple), writes it to a tuplestore, and probes the join hash,
none of it as local as a contiguous array walk.
It also beats sparsevec, the one SQL option that runs a real kernel instead of the row executor:
| per-item cost | ops/item | total over 5,668 items | |
|---|---|---|---|
| in-app bitset | O(item nonzeros) | ~386 | ~2.2M |
sparsevec <#> |
O(item + set nonzeros) | ~14k | ~82M |
pgvector stores a sparsevec as a sorted list of (feature_id, weight) pairs, and <#>, the inner
product, is a two-pointer merge over two of them: walk both lists, multiply where the indices match,
advance the smaller otherwise. The merge can't jump to an index, so it walks the whole ~14k user set
for every item, even though only ~386 entries can possibly overlap. The bitset's O(1) membership
deletes that walk: ~37× less work per item. Only k goes dense (~6 KB); a dense D would be
~10 GB at 100k items, so the matrix stays sparse.
In the same columns as the database table:
| representation | warm 1× | warm 10× | cold 1× | note |
|---|---|---|---|---|
| in-app bitset | 8.45 ms | 90.6 ms | — | never cold per request; the corpus loads once at boot (~1.25 s), then resident |
The query never needed tuning
No index or typed column helps, because the work is a matrix-vector product and a row-at-a-time executor is the wrong place to run one. In app RAM the matrix loads once at boot, off the request path, and each request walks contiguous int arrays in a few milliseconds, with no cold path.
- TOAST is how Postgres stores a field too big to keep inline: past ~2 KB the value is compressed and moved to a side table and fetched back on read, so the row still fits its 8 KB page. ↩
- Non-burstable
n2-custom-2-4096, 4 GB / 2 vCPU, network HDD; median-of-5 warm and median-of-3 cold; every table indexed as in prod; a differential oracle checks all representations agree on each item's score. The first cut ran on a burstablee2-small, which throttles once it burns its CPU credits mid-run: the tell was the int[] tier's warm run coming in slower than its cold run, which is impossible. Re-running non-burstable fixed it. ↩ int[]only sheds JSON parsing and string-key hashing, not the tuple-manufacturing that dominates:unnest's SRF hits the sameheap_form_minimal_tuplepath. The 1,922 ms reads as slower than the 1,376 ms baseline only because that baseline is jsonb with itsMATERIALIZEDfence removed; fenced like-for-like the two are even (jsonb 2,004 ms,int[]1,922 ms), so it's the row source that's a wash, not a regression. Normalizing pre-explodes to disk: 4× faster cold, the same seconds warm, ~14 GB at the target. An index gives membership, not the per-element weights the score needs, and the gather is already ~3 ms. ↩- If the score were just the number of shared features it'd be a set-intersection size. It's a weighted sum, each shared feature contributing its weight, which is exactly D · k. Membership alone isn't enough; you also need the weight. ↩
- Why exact, and not a bloom filter? The set is tiny: ~48k bits is 6 KB, small enough to sit in cache, so there's nothing to save with a probabilistic structure, and a bloom filter's false positives would over-count the overlap. Exact O(1) membership, no downside. ↩
- The 8 ms is the scoring kernel itself, not the ranking endpoint's end-to-end latency; the endpoint does more work around it, so its latency is higher. ↩