Search systems are always broken. Mend them with care, and like Kintsugi, they become more beautiful.

Recently I’ve been diving deep in improving search & retrieval for Littlebird and here are some notes worth sharing.

Turbopuffer

Tpuf is an amazing search system built on top of object storage. Biggest plus points for using tpuf was multi tenancy, low latency, fast and support for BM25 & Semantic search. Their support is also amazing. We have a shared slack channel with them and the folks there are extremely helpful and prompt in answering all kinds of questions, including a lot of stupid ones i bug them about frequently!

On top of that their docs & API’s are a pleasure to work with. Plus small things they do like the one mentioned below are dope signals that they care about devex.

For example: Tpuf has a concept of Multi-queries, where you can execute multiple queries in a single call. They have a limit of how many queries can be executed concurrently for a given namespace. But if you use multi-queries, there is a higher chance of you exceeding that limit(say the limit is 16, you have already done 10 and in parallel you hit a multi-query with 8 queries, you have exceeded it by 2!). But they know a lot of customers use multi-query for better performance and allow you to go over the limit a little :)

Small things like this make for a stickier customer!

Some tips when using Tpuf (or any system built on object storage like S3)

Namespace naming convention: You should use user IDs(or your tenant id’s) in prefix. Turbopuffer is built on object storage s3, it gives prefix queries on namespaces. This is a common pattern for storing user data on object storage. For example: if your namespaces are named like prod_{table_name}_{user_id} you wont be able to search for all namespaces for a given user_id.

Searching for all namespaces for a given ID is extremely helpful for compliance cases, when a user wants to delete all of their data, list all namespaces for a given user_id, etc.

Schema Design: While Tpuf is amazing, it’s still evolving. It’s not very simple to update the schema nilly-willy like we are used to with RDBMS. The docs cover what is possible vs not, but spending a bit more extra effort than usual here, will help you out quite a bit in the future.

Use cache warming: Again, it’s in their docs, but using cache warming can improve your latency by quite a bit. It’s cheap and easy to do. If you are working on a system where latency is key, you should start warming up your caches as it’ll help you reduce the first query latency by a lot & in turn improve your user experience!

Hybrid search

It’s 2025 and people are finally realizing Vector Search is not it’s hyped up to be. While it’s still great and OpenAI giving easy access to embeddings made this really popular BUT it has clear limitations in production settings. The real value of vector search shows when you combine it with high precision lexical retrieval (such as BM25)!

Here’s a pretty popular & recent paper 'On the Theoretical Limitations of Embedding-Based Retrieval' that shows exactly this!

Here are some random points you should keep in mind, in no particular order.

RRF

When doing Hybrid search, using RRF becomes a goto choice for most people to combine results across multiple queries. While there has been a lot written about RRF already, one key point to keep in mind about it is, if one of the BM25 or vector search is bad, doing RRF over it will pull down the good results from the one which is actually performing well!

Having a good understanding of query intent helps a lot in figuring out what weights to assign to lexical vs semantic search in RRF and boost the right docs!

Re-Ranking using Cohere

Cohere has one of the most popular Re-ranking models out there. It’s pretty good but here are some observations you should know about, esp for latency. For some reason there aren’t any docs or blogs by cohere on this topic.

Latency for Structured docs is almost 2x of unstructured: Coehre offers to do re-ranking on structured docs as well. You need to send it to them in a YAML format. But there is a huge price to pay in terms of latency, which is almost double of that of unstructured.

Latency is more dependent on size of docs than # of docs: If your re-ranking is taking a lot of time, rather than trying to optimise by sending lesser number of docs you’ll see a bigger difference when sending smaller docs.

Some common(basic) issues in Agentic Search

If you are not meeting your eval scores or seeing bad results after implementing a standard search system AND it’s an LLM at the core of search(which is by definition agentic search), before jumping into the depths of which fusion algo makes more sense or something niche, try to check for these first.

Is the LLM using the correct filters?: A very common use case is where LLM decides what to search for and in what date-range. But LLMs are not very good at figuring out the correct date ranges. For queries like “What am i doing now“, what time range == “now“ is pretty ambiguous. Get this wrong and no matter how good your search system is, it’ll fail. Also, don’t forget about time-zone issues.

Query Fan-out: It’s very lucrative to use LLMs for this and by all means, LLMs should be very good at it. But look at the sub-queries that LLMs are generating and there is a good chance it might be an issue. Explaining it properly to the LLM, what these queries are going to be used for(sub queries for BM25 & Vector shouldn’t look the same) will go a long way!

These 2 are usually the first few steps in any agentic search and messing this up means all the search down the road isn’t going to give you the results you want.

Evals for search

I just wanted to add this section as a reminder to not forget about evals among all this and here is a great blog on it and this part sums up a lot for me.

PS: I know this blog really isn’t well structured, but i really wanted to get back to writing again and rather than doing a deep dive into each in a single blog, i’ve started with things on the top of my head. Will go into much more depth and practical points in the upcoming blogs :)

Subscribe here or follow me on X to get notified or have more discussions on similar topics!

Some notes on Agentic search & Turbopuffer

Turbopuffer

Some tips when using Tpuf (or any system built on object storage like S3)

Hybrid search

RRF

Re-Ranking using Cohere

Some common(basic) issues in Agentic Search

Evals for search

Comments

More from this blog

Current Agentic Development Workflow(June)

The cost of doing evals has gone down substantially...

K.I.S.S(Keep It Static, Stupid): System prompt ft. caching

Not All Caches Are Equal: Claude, OpenAI, and Gemini

Command Palette

Turbopuffer

Some tips when using Tpuf (or any system built on object storage like S3)

Hybrid search

RRF

Re-Ranking using Cohere

Some common(basic) issues in Agentic Search

Evals for search

Comments

More from this blog