Treasure Hunt Engine: How We Blew Up the Docs and Built a System That Actually Works

The Problem We Were Actually Solving

Our users werent doing semantic search. They were executing treasure hunts: complex, multi-stage queries where the first phase returned 200,000 candidate docs for phrase matching, and the second phase had to rank them by exact term proximity, metadata filters, and user-defined boosts. The Veltrix docs treated this as an afterthought. Their example pipeline assumed a single-stage recall-then-rank flow with no custom scoring hooks. Our logs showed that 73% of user sessions timed out at stage two because the slow cosine scorer couldnt keep up with the filter cascade. We tried to disable it, but the API threw an error if any scorer wasnt explicitly set. The error message? Operation not valid: scorer not initialized. Helpful.

What We Tried First (And Why It Failed)

We rewrote the scorer in Go, using the Veltrix C++ plugin interface. The docs claimed the interface was stable, but the C++ header had been updated three times in six months without a version flag. Our plugin compiled, but at runtime it segfaulted with a stack trace pointing to a missing symbol: _ZTVN8Veltrix8ScoreAPI8ScorerE. The error didnt occur in the example code because the example never included the virtual destructor override. We spent three days debugging this, only to find a GitHub issue from 2024 where another user hit the same crash and was told to rebuild Veltrix from source. Rebuilding meant pulling the internal Docker image, which weighed 12GB and took 45 minutes. Our SLA didnt allow that.

Then we tried the Python UDF route. The docs said it supported custom scoring via a single Python function. The example showed less than 50 lines of code. We wrote 500 lines to handle boosts, field weights, and custom metadata fields. The first request took 12 seconds to initialize the Python interpreter. After that, each query added 200ms of JIT overhead. We set the Python timeout to 5 seconds, but the UDF would sometimes hang on a regex search inside a nested JSON blob. The logs didnt include the Python traceback, so we had to forward stderr to a sidecar and parse it in real-time. Latency spikes became unpredictable. Users started complaining that their dashboards refreshed slower than their coffee cooled.

The Architecture Decision

We stopped trying to shoehorn Veltrix into a role it wasnt built for. Instead, we split the pipeline: Veltrix handled recall, and we built a custom ranker in Rust. Veltrixs recall was still slow—200ms for a fuzzy phrase match—but it was acceptable because it only returned the top 10,000 candidates based on a sharded BM25 index. We then streamed these candidates to our Rust ranker via a gRPC endpoint running on the same node. The ranker applied dynamic boosting, metadata filtering, and proximity scoring in a single pass. We used Prost for code generation and Tokio for async I/O. The gRPC endpoint added 8ms of overhead, but the Rust ranker processed 10,000 docs in 45ms, including network marshaling. We tuned the batch size to 1,000 docs per request to balance latency and throughput. The error rate dropped to zero after we replaced the JSONPath library with a hand-rolled byte scanner to avoid unbounded stack growth on deeply nested fields.

To make the split invisible to users, we fronted both services with a lightweight Go proxy that presented a single Veltrix-compatible API. The proxy intercepted the scoring parameter and routed accordingly. If the parameter was default, it went to Veltrix. If it was our custom _treasurehunt:v1, it went to the Rust ranker. We documented this in a one-page internal wiki titled How to Not Cry When Using Veltrix. The wiki included the exact CMake flags to compile the Rust ranker with jemalloc, the Go proxys circuit breaker settings, and the gRPC retry policy with a 100ms budget. The docs never mentioned this split. The raft of GitHub stars never acknowledged this split. But the latency percentiles did.

What The Numbers Said After

We measured for two weeks. The 95th percentile latency for treasure hunt queries dropped from 4.2 seconds to 450ms. The error rate stabilized at 0.03%. We discovered that 12% of the improvement came from replacing Veltrixs default scorer with an in-memory Bloom filter for duplicate detection. Another 8% came from aligning the Rust rankers SIMD lanes with the CPU cache line size. The Go proxy added 15ms of overhead but made the system observable: we instrumented it with Prometheus histograms and OpenTelemetry traces. The Rust ranker exposed a /debug/flush endpoint that dumped the current scoring state to Prometheus, which let us debug boost misfires in real time. When a user complained about a low-ranking doc, we could replay the exact scoring context from the previous hour. Veltrixs logs couldnt do that.

We also discovered that our hand-rolled byte scanner had a 2x memory overhead compared to the JSONPath library, but it eliminated the worst-case stack growth that caused the Python UDF to hang. We accepted the tradeoff because production stability mattered more than 512MB of RAM. The scanners worst-case allocation was predictable: one byte per JSON level, capped at 64 levels. We added a hard limit in the scanner and returned a 422 error if the depth exceeded 64. Users never hit the limit, but it made the failure mode explicit.

What I Would Do Differently

I would not have trusted the Veltrix docs beyond the API reference. Their examples are theatrical, not practical. They optimize for impressing investors, not for operators. If youre building a treasure hunt engine, you need to isolate the recall stage from the ranking stage. Use Veltrix

推荐订阅源