Making a 16-second query run in 8 ms

Recently I was writing a recommendation system, as I started including more things in my index it got surprisingly slow. I found out that one of the operations it was doing can hit a wall way faster than I would expect .

I was ranking ~5.7k small items(48k feature space) on a 4 GB GCP box: score each item against the user, return the top matches. Postgres took seconds, and got worse as the index grew. Luckily I managed to eventually solve that part. At the ~5.7k index (1×) and 10× it (~56k):

warm 1× warm 10× cold 1×
baseline 1,376 ms 14,888 ms 16.1 s
final version 8.45 ms 90.6 ms

last thing to mention: Precomputing is out the user's set changes every request, so the ranking runs live each time.

Why it's slow

Each item is stored as a JSON object, a map of feature → weight, a few hundred entries out of a ~48k-feature space. The query pulls the weights back out of those objects on each request, which expands to ~2.2M (item, feature) rows that then get aggregated against the user's set:

-- explode each item's object, join the user's set, sum the matched weights
SELECT c.item_id, sum(f.weight::int) AS covered
FROM items c, jsonb_each_text(c.features) AS f(feature, weight)
JOIN user_set s ON s.feature = f.feature::int
GROUP BY c.item_id;

TOAST1 I/O is the obvious suspect. Those JSON objects are big enough to live out of line, so reading the index looks like a wall of random fetches.

Warm, it's CPU

With the index resident it still takes 1.4 s, and that goes into manufacturing and hash-joining the ~2.2M exploded rows, not into JSON parsing.

Cold, it's I/O

TOAST was right there: forced cold, the query is 16 s, dominated by random TOAST reads off the network disk.

It won't stay warm

regime when cost cause
cold after a deploy, or when the pages are evicted ~16 s random TOAST detoast I/O
warm resident in cache 1.4 s CPU explode + hash join

The whole database doesn't fit in RAM, so unrelated traffic keeps evicting the feed's pages: open one item and the index falls out of cache, and a deploy clears it entirely. Today that's a per-request swing from 1.4 s to 16 s. The worry is the trajectory: as the database outgrows RAM the box spends a growing share of its time evicting and re-reading pages instead of serving, and under load that tips into thrashing. A bigger box buys back the cold half: with enough RAM the corpus stays resident and the cold path never happens, so cold is really a small-box artifact. The warm seconds barely move, though, and they're the real problem.

It isn't the representation

Three other shapes score the same batch:

-- int[]: unnest two parallel arrays (no JSON anywhere)
SELECT c.item_id, sum(f.weight) AS covered
FROM items c, unnest(c.feature_ids, c.weights) AS f(feature, weight)
JOIN user_set s ON s.feature = f.feature
GROUP BY c.item_id;

-- normalized: the (item, feature, weight) rows already exist on disk
SELECT f.item_id, sum(f.weight) AS covered
FROM item_features f JOIN user_set s ON s.feature = f.feature
WHERE f.item_id = ANY(:items)
GROUP BY f.item_id;

-- sparsevec: no rows at all, one C kernel per item
SELECT c.item_id, -(c.features <#> :user_set) AS covered
FROM items c;

int[] and the normalized table are the jsonb query again, the same join and sum with no JSON anywhere; only the row source changes. Where each lands, at 1× (~5.7k items) and 10× it (~56k), warm median on the non-burstable 4 GB box2:

representation warm 1× warm 10× cold 1× note
jsonb (no-fence) 1,376 ms 14,888 ms 16.1 s baseline
int[] columns 1,922 ms
normalized (clustered) 1,718 ms 3.7 s 4× cold, still seconds warm
pgvector sparsevec 447 ms 4,699 ms 0.6 s best SQL can do, still 4.7 s @ 10×

The same ~2.2M rows get manufactured and aggregated every way.3 Only sparsevec changes the shape, running the overlap as a C kernel instead of a row explode. That buys a stable ~3× over jsonb, but it's still 4.7 s at 10×. It can't random-access either, so it walks the whole ~14k user set per item; the bitset later deletes that walk.

It's a sparse matrix-vector product

Every representation lands in the same place because the work isn't relational. Squint and it's arithmetic: line the items up as the rows of a sparse matrix D (D[i][f] = weight of feature f in item i, ~386 of ~48k nonzero), the user's set as a 0/1 vector k, and the whole ranking is one matrix-vector product4:

covered = D · k            # one covered-weight per item

Its cost is one multiply-add per nonzero of D. A relational plan materializes every (item, feature) pair as a tuple instead, ~2.2M per request, then hash-joins and groups to recover the same sum. What the SQL executor actually runs:

for each item:
    for (feature, weight) in item.features:    # a tuple per pair, ~2.2M rows total
        emit (item_id, feature, weight)        #   materialized into a tuplestore
hash_join(those rows, user_set)
group by item_id: sum(weight) where feature ∈ user_set

The same computation as a plain loop in app RAM, no database involved:

user = bitset(feature_ids)                     # built once from the user's set; membership is O(1)
for each item:                                 # the matrix is resident, int-encoded
    covered = 0
    for (feature_id, weight) in item:           # ~386 ids, a tight array loop
        if user.test(feature_id): covered += weight   # no tuples, no join, no allocation

Nothing about that loop needs a database: hold the int-encoded matrix in RAM as contiguous (feature_id, weight) arrays, make k a dense bitset5, and each item is a walk over its ~386 nonzeros, one O(1) bit test each. That's 8 ms6 at the index's scale and 90 ms at 10×.

The matrix-vector shape isn't the whole win. The explode touches roughly the same ~2.2M nonzeros the loop does, so a lot of the gap over SQL is the cost per step, not their number: the bitset is ~6 KB and stays in cache, each item's ids and weights are contiguous and prefetch-friendly, and the loop allocates nothing, so a nonzero is a few cache-hot instructions. At that same step the SQL executor manufactures a tuple (heap_form_minimal_tuple), writes it to a tuplestore, and probes the join hash, none of it as local as a contiguous array walk.

It also beats sparsevec, the one SQL option that runs a real kernel instead of the row executor:

per-item cost ops/item total over 5,668 items
in-app bitset O(item nonzeros) ~386 ~2.2M
sparsevec <#> O(item + set nonzeros) ~14k ~82M

pgvector stores a sparsevec as a sorted list of (feature_id, weight) pairs, and <#>, the inner product, is a two-pointer merge over two of them: walk both lists, multiply where the indices match, advance the smaller otherwise. The merge can't jump to an index, so it walks the whole ~14k user set for every item, even though only ~386 entries can possibly overlap. The bitset's O(1) membership deletes that walk: ~37× less work per item. Only k goes dense (~6 KB); a dense D would be ~10 GB at 100k items, so the matrix stays sparse.

In the same columns as the database table:

representation warm 1× warm 10× cold 1× note
in-app bitset 8.45 ms 90.6 ms never cold per request; the corpus loads once at boot (~1.25 s), then resident

The query never needed tuning

No index or typed column helps, because the work is a matrix-vector product and a row-at-a-time executor is the wrong place to run one. In app RAM the matrix loads once at boot, off the request path, and each request walks contiguous int arrays in a few milliseconds, with no cold path.


  1. TOAST is how Postgres stores a field too big to keep inline: past ~2 KB the value is compressed and moved to a side table and fetched back on read, so the row still fits its 8 KB page.
  2. Non-burstable n2-custom-2-4096, 4 GB / 2 vCPU, network HDD; median-of-5 warm and median-of-3 cold; every table indexed as in prod; a differential oracle checks all representations agree on each item's score. The first cut ran on a burstable e2-small, which throttles once it burns its CPU credits mid-run: the tell was the int[] tier's warm run coming in slower than its cold run, which is impossible. Re-running non-burstable fixed it.
  3. int[] only sheds JSON parsing and string-key hashing, not the tuple-manufacturing that dominates: unnest's SRF hits the same heap_form_minimal_tuple path. The 1,922 ms reads as slower than the 1,376 ms baseline only because that baseline is jsonb with its MATERIALIZED fence removed; fenced like-for-like the two are even (jsonb 2,004 ms, int[] 1,922 ms), so it's the row source that's a wash, not a regression. Normalizing pre-explodes to disk: 4× faster cold, the same seconds warm, ~14 GB at the target. An index gives membership, not the per-element weights the score needs, and the gather is already ~3 ms.
  4. If the score were just the number of shared features it'd be a set-intersection size. It's a weighted sum, each shared feature contributing its weight, which is exactly D · k. Membership alone isn't enough; you also need the weight.
  5. Why exact, and not a bloom filter? The set is tiny: ~48k bits is 6 KB, small enough to sit in cache, so there's nothing to save with a probabilistic structure, and a bloom filter's false positives would over-count the overlap. Exact O(1) membership, no downside.
  6. The 8 ms is the scoring kernel itself, not the ranking endpoint's end-to-end latency; the endpoint does more work around it, so its latency is higher.

← all posts