Vector Search on the KV260: Multi-Query Batching, and the Hidden 2× I Almost Missed

This is another "here's what I built, here's where it broke, here's what I learned" post. Same board as before (the AMD Kria KV260), different problem — vector retrieval instead of LLM inference — but the same theme keeps coming up: memory bandwidth is almost always the real ceiling, and the bug that hurts the most is the one you don't notice because the number looks reasonable.


The Setup: DocSear

I wanted to build a semantic search engine over a personal library of books. Type a query like "beautiful writing about river and water" and get back the chunks (~500 chars each) that actually mean that — not just chunks that share keywords.

The standard recipe: embed every chunk into a fixed-dimension vector, embed the query into the same space, return the top-10 chunks by cosine similarity. For a corpus of 10,000 books at ~400 chunks each, that's ~4 million vectors. At 512 dimensions × INT8 each, the chunk database is ~2 GB.

The interesting question: where does the math actually happen?

Most people reach for an approximate nearest-neighbor index — HNSW, IVF-PQ, that whole family. They're fast but lossy: pgvector with HNSW (m=16, ef_search=40) runs my workload at ~6 ms with 90 % recall. That's totally reasonable for production.

But brute force is correct. Every query, score every chunk, return the actual top-10. The arithmetic is embarrassingly parallel and uniform — exactly the kind of workload an FPGA was made for. I had a KV260 already sitting on the desk from the BitNet experiment, and I wanted to see how far I could push it.

The headline result, before I get into how:

Backend Per-query latency Recall@10 vs. pgvector
pgvector fp32 seqscan 604.5 ms 100 %
FPGA (single query) 4.12 ms 100 % 147×
FPGA (8-query batch) 0.537 ms eff. 100 % 1,125×

Same KV260, same INT8 vectors, exact brute force on both sides. The accelerator does what pgvector does, just a thousand times faster.

The rest of the post is how I got from the 147× number to the 1,125× number, and the wrong turns along the way.


The Basic FPGA Pipeline (K=1)

The pipeline at K=1 — single query at a time — has three HLS-generated IPs cooperating on the FPGA fabric:

DDR (chunks)  ──HP0──►  axi_burst  ──AXIS──►  mac_array  ──K AXIS──►  topk_sorter
   13 MB                stream chunks         dot-product            keep top-10
   25,587               512 B at a time       int8 → int32           per query
   vectors
                                                 ▲
                                            query (128 B/lane × 32)
                                            held in BRAM

In rough pseudocode, what mac_array does each clock cycle:

loop over 25,587 vectors:                # one chunk = 512 INT8 bytes
    loop 32 times (one 128-bit beat each):
        beat = axi_burst.read()          # 16 chunk bytes this cycle
        for lane in 0..15:               # 16 parallel int8 macs
            acc += chunk_byte[lane] * query[d + lane]    # held in BRAM
    score_out.write({acc, vec_index})    # one score per chunk
    acc = 0

topk_sorter is just a 10-slot running insertion sort: every incoming score either pushes into the top-10 (with the smallest existing slot evicted) or gets dropped. By the time the last vector flows through, the top-10 register file holds the answer, and the host reads it back via AXI-Lite.

The whole thing runs at 200 MHz. Each chunk takes 32 cycles to score = 160 ns. Times 25,587 = 4.1 ms of compute. The actual measured number was 4.12 ms — within 1 % of the theoretical bound. The pipeline is fundamentally DDR-bandwidth-bound: we're moving 13 MB of chunk data through one HP port per query, and that's the entire critical path.

Doubling the ports (HP0 + HP1, each pulling half the database in parallel) actually doubled throughput cleanly to 4.12 ms from the original ~8 ms single-port version. That was the easy win.

But after dual-port, there's no third port to add. The FPGA fabric is mostly idle — 1 DSP out of 1,248 is doing useful work per cycle, the other 1,247 are sitting there. The bottleneck is unambiguously bytes-per-second from DDR, not ops-per-second from the fabric.


The Batching Idea: Reuse Each Byte K Times

Here's the trick that unlocks the next jump. When you process one query, each chunk byte arrives from DDR and gets used exactly once — multiplied by one query byte, then it's gone. The DDR bandwidth is "expensive" (a precious 3 GB/s), but the FPGA compute is "free" (we have 1,000+ idle DSPs).

What if every chunk byte was used K times instead of once?

                                                                ┌─► mac (query 0)  ─► topk
                                                                ├─► mac (query 1)  ─► topk
chunk byte from DDR  ──►  broadcast wire to K query lanes  ──┤  ├─► mac (query 2)  ─► topk
                                                                ├─► ...
                                                                └─► mac (query K-1) ─► topk

Same DDR read, K times the compute per byte. Throughput goes up linearly in K, for free, until either the fabric runs out of multipliers or the build tools run out of memory. (Spoiler: the second one bites first.)

In HLS, the inner loop becomes:

// pre-load K=8 queries into local registers (only at the start of a batch)
load_queries_from_ddr_via_hp2();

loop over chunks:
    loop 32 times (one 128-bit beat each):
        beat = axi_burst.read();
        for lane in 0..15:                        # 16 byte lanes per beat
            #pragma HLS UNROLL
            for k in 0..K-1:                      # K parallel accumulators
                #pragma HLS UNROLL
                acc[k] += beat[lane] * query[k][d + lane];
    for k in 0..K-1:
        score_out[k].write({acc[k], vec_index});  # K scores per chunk
        acc[k] = 0;

The pragmas matter: UNROLL tells the synthesizer to materialize all K × 16 = 128 INT8 multipliers at once. Pair that with the SIMD packing trick (two INT8 macs per DSP48E2), and you get 256 useful ops per cycle per mac_array instance — for the same DDR rate.

This is what the LLM world calls B-reuse in GEMM, or "batch-reuse along the input dimension." On a GPU it's automatic; on an FPGA you wire it in yourself.

I locked all the interface decisions down in a design doc before touching code, picked K=16 as the target (~512 DSPs = 41 % of the KV260, well within budget), and started implementing.


The First Wall: Build-Host RAM, Not Chip Resources

K=16 csim passed on the first try — the algorithm was right. The chip had room — 41 % DSP, ~56 % LUT, plenty of BRAM.

Then I tried to synthesize.

ERROR: process killed: out of memory
  vivado synth_1 peaked at ~5.5 GB during Cross Boundary Optimization

The problem wasn't the chip. The problem was the build machine — my home server has 6 GB of RAM total, and Vivado's synth pass for the K=16 topk_sorter was generating 2,050 × 32-bit multiplexers (a K-way parallel insertion sort against 10 sorted slots is a lot of comparators), and the area-optimization pass needed more memory than I had.

I tried K=8. It got further — past Timing Optimization, peaked at 4.4 GB — and then got OOM-killed during Technology Mapping. Even with -jobs 1 to disable parallelism.

K=4 finally built (~45 min). I deployed it, ran the benchmark, got correct top-10 results for all 8 queries in the warmup batch...

per-batch median  :  8.21 ms    (4 queries / batch)
effective per-query: 2.05 ms
sustained QPS      : 487
recall @ 10        : 100 %

A clean 2× speedup over K=1 (4.12 → 2.05 ms). Numbers match my predictions about half-way: I'd modeled K=4 as ~1 ms/query, not 2.05.

This is exactly the kind of result that's dangerous. It looks fine. 2× is a real improvement. The instinct is to commit it, write up the journey, move on. I almost did.

What kept me from doing that was a habit I've been trying to build: before celebrating a number, derive what the number should be from first principles, and only celebrate if it matches. K=4 should have been ~1 ms, not 2.05.

There was a 2× hiding somewhere.


The Hidden 2×: A Two-Phase Topk That Looked Innocent

The design has two compute pipes (HP0 and HP1), each reading half the database. With #pragma HLS DATAFLOW they're supposed to run in parallel. The topk_sorter was structured like this:

// phase 0: drain mac_0's score streams
for v in 0..n_pipe0:
    #pragma HLS PIPELINE II=1
    for k in 0..K-1:                # K parallel insertion sorts
        score = score_in_0[k].read()
        insert_into_slots(slots[k], score)

// phase 1: drain mac_1's score streams
for v in 0..n_pipe1:
    #pragma HLS PIPELINE II=1
    for k in 0..K-1:
        score = score_in_1[k].read()
        insert_into_slots(slots[k], score)

Looks reasonable, right? Each phase pipelines at II=1, eats from its own pipe's streams, updates the same slots[k] array. Total work = roughly 2 × (N/2) cycles = N cycles = 25,587 cycles = 0.13 ms. Plenty fast.

Here's what I'd missed: HLS DATAFLOW parallelises functions, not loops within a function. The two for loops above compile to sequential FSM states. While phase 0 is running, the topk literally isn't reading from mac_array_1's score outputs.

And here's the consequence: mac_array_1 produces a score every 32 cycles per stream, but its output FIFO is 2 entries deep by default. So after two beats — about 320 nanoseconds — mac_1's output FIFOs are full, and the entire mac_1 pipeline stalls on backpressure. For the full 2.05 ms that phase 0 takes.

Only after mac_0 finishes does topk move to phase 1 and start draining mac_1. Now mac_1 finally gets to run — for another 2.05 ms. Total wall-clock: ~4.1 ms of useful work serialised into ~8.2 ms.

The two compute pipes — which Vivado's block diagram clearly shows as parallel — were actually running sequentially because of a single sequential boundary in the downstream IP. The math worked out perfectly. The throughput model failed by 2×.

What's worse: this 2× was hiding in the K=1 dual-HP-port branch too. The "4.12 ms" baseline that everyone was anchored to was already 2× slower than the bandwidth model said it should be — but the model was never written down anywhere, so nobody noticed.

This is the bug I'll remember from this project. Not because it was hard to find (once I sat down to derive the expected number, it was obvious in ten minutes), but because the entire dev loop had been operating on numbers that "looked right" without ever pinning them to a theoretical ceiling.


The Fix: Two Sorters, One PS-Side Merge

Now, three options to recover the 2×:

Option Approach Topk LUTs Synth memory Risk
A — unified-phase One loop reading both pipes per cycle (2K parallel insertions) ~5–6 GB might OOM the build host again
B — per-pipe topk Two identical sorters, each owning one pipe; PS merges the two top-10 lists each 0.5× each ~2.5 GB safe
C — drop topk from PL Mac writes raw scores to DDR, PS sorts 0 smallest most code change, ~1.7× recovery only

I went with B. The 10+10→10 merge in software is trivial (~100 µs for a batch of K queries, dominated by AXI-Lite read latency, not the sort itself). Each topk shrinks to ~half the previous size because it only reads K streams instead of 2K, runs only one drain loop instead of two. The single sequential boundary that caused the stall is gone — split into two completely independent IPs that run truly in parallel.

The architectural change is small but consequential:

Before (one monolithic topk, sequential phases):

  mac_array_0 ──┐
                ├──► topk_sorter (drains pipe-0, THEN pipe-1)
  mac_array_1 ──┘                       ↑
                                   stalls mac_1


After (two identical sorters, each owning one pipe):

  mac_array_0 ──► topk_sorter_0 (drains pipe-0 only) ──► AXI-Lite
                                                              ╲
                                                          PS does
                                                       10 + 10 → 10
                                                              ╱
  mac_array_1 ──► topk_sorter_1 (drains pipe-1 only) ──► AXI-Lite

In HLS, the new sorter is half the code:

void topk_sorter(score_in[K], topk_out[K][10], n_vectors):
    for v in 0..n_vectors:
        #pragma HLS PIPELINE II=1
        for k in 0..K-1:                  # K parallel insertion sorts
            #pragma HLS UNROLL
            score = score_in[k].read()
            insert_into_slots(slots[k], score)
    write_results_to_axilite(topk_out, slots)

One loop, one input set, no sequential boundary. The two instances synthesise sequentially with -jobs 1 (each peaking at ~2.5 GB), well within the home server's 6 GB. And because each is smaller now, K=8 finally fit where the monolithic K=8 hadn't.

Total LUT usage with two K=8 sorters: 64 % of the chip. Still plenty of room.


A Bring-Up Bug Worth Mentioning

First post-deploy run, K=8 firmware loaded, kicked the pipeline — and got back top-10 results that were all (0, 0) for most queries, except query slots 6 and 7 contained the correct answers for queries 0 and 1.

Classic "computing the right thing, writing to the wrong place" symptom. Took about ten minutes to track down.

The cause: Vitis HLS auto-generates the AXI-Lite address map based on how wide a decoder the register file needs. At K=4, topk_out[40] fit at base address 0x200. At K=8, topk_out[80] needed wider decoding, so HLS quietly shifted the base to 0x400. My Python driver had 0x200 hardcoded from the K=4 build, and was reading from a partially-mirrored undefined region.

The fix was a one-line constant change. The lesson is bigger: never hardcode HLS-generated addresses across rebuilds. Always re-grep *_hw.h after every HLS run.


The Final Numbers

After all of that:

Backend Algorithm Per-query latency Throughput Recall@10 vs. pgvector
pgvector fp32 seqscan brute force 604.5 ms 1.7 QPS 100 %
pgvector fp32 + HNSW (ef_search=40) approximate 5.89 ms 170 QPS 90 % 103×
FPGA K=1 (dual HP port) brute force 4.12 ms 243 QPS 100 % 147×
FPGA K=4 (monolithic topk) brute force 2.05 ms eff. 487 QPS 100 % 295×
FPGA K=8 (dual topk + PS merge) brute force 0.537 ms eff. 1,864 QPS 100 % 1,125×

The cumulative speedup chain in latency terms:

numpy int32 brute force (CPU)    62.3   ms      (1×       baseline)
FPGA single port                  8.21  ms      (7.6×)
FPGA dual HP port (K=1)           4.12  ms      (15.1×)
FPGA K=4 batched                  2.05  ms eff. (30.4×)
FPGA K=8 dual topk                0.537 ms eff. (116×)

Same KV260, same INT8 vectors, exact retrieval throughout. The 116× over numpy is the apples-to-apples comparison; the 1,125× over pgvector includes the framework overhead pgvector pays for being a general-purpose database engine.

The chip is at 64 % LUT and 4 % DSP. K=16 should theoretically fit and would deliver around 3,700 QPS — but the build host's 6 GB RAM is the gating factor, not the silicon. A 32 GB threadripper-class machine would probably synthesise K=16 in one shot.


What I'd Like You to Take Away

The headline number is fun, but the more important things are smaller.

Derive the expected number before you celebrate the measured one. I shipped K=4 at 2.05 ms/query thinking it was a clean 2× over K=1. The number was right — it just didn't match what the bandwidth model said it should be, and I hadn't bothered to derive that number explicitly. If I had, I would have caught the phase-serialisation bug three months earlier in the project.

#pragma HLS DATAFLOW parallelises functions, not loops within a function. This is the kind of subtle thing the docs technically say but you don't feel until it bites you. Any time you see two sequential for loops in an IP and assume they pipeline, double-check.

On a small build host, your design ceiling is not the chip's resources — it's Vivado's synthesis memory peak. The K=16 monolith would have run beautifully on the KV260. It just couldn't be built. Splitting one big sorter into two smaller identical ones halved the per-IP synthesis complexity and unlocked K=8 that had previously OOMed. Sometimes the right architectural move is the one that fits your tools, not your silicon.

Move work to the side of the boundary where it's cheap. The PS does a K × (10+10 → 10) merge in software in ~100 µs. Trying to do that in PL would have needed a third IP, more block-design work, more DTBO entries, and another synth pass. The KV260 has a quad-core A53 sitting idle next to the FPGA — let it do the small, irregular things, and keep the FPGA doing the wide, regular things.


I keep coming back to the KV260 because it sits right at the threshold where you can actually finish things on a hobbyist time budget. The compile cycle isn't 18 hours like a serious datacenter card. The chip has enough fabric to do real workloads but few enough resources that you have to think carefully. And every time I push on it, I learn something — not just about FPGA design, but about the way I'm thinking about a problem.

This one taught me that the bug worth fixing is the one that looks like an honest 2× improvement.