Here’s how to use one SQL query to get the top K vector matches in each category.
At SingleStore, our position on vector database processing is that you should have all the benefits of a modern, full-featured DBMS available to you when working with vectors [Han23]. This includes full SQL support.
We're working with a prospective user of SingleStoreDB for vector database operations related to an application with a semantic search component. He asked how we could do the following easily: find the top K items in each category. This is something he didn't have an easy time with in Milvus, a specialty vector database.
With Milvus, he was finding the categories in one query, looping through them and finding the top K elements for one category at a time with a separate query. This is not easily parallelizable, and requires more work from the application side than many would prefer.
Here's how you can do this in a single SQL query in SingleStoreDB:
/* Make some items in multiple categories, with associated
vector embeddings. */
create table items(id int, category varchar(50), vector blob);
insert into items values
(1, "food", json_array_pack('[0,0,0,1]')),
(2, "food", json_array_pack('[0,0.5,0.3,0.05]')),
(3, "food", json_array_pack('[0,0.5,0.2,0]')),
(4, "facilities", json_array_pack('[0,0,1,0]')),
(5, "facilities", json_array_pack('[0,0.6,0.1,0.05]')),
(6, "facilities", json_array_pack('[0,0.4,0.3,0]'));
-- query vector
set @qv = json_array_pack('[0,0.4,0.3,0]');
-- get top 2 in each category using ranking
with scored as(
select id, category, dot_product(vector, @qv) as score
from items
),
ranked as (
select
row_number() over(partition by category order by score desc)
as rank, *
from scored
)
select *
from ranked
where rank <= 2
order by category, rank;
These are the results:
+------+------+------------+---------------------+
| rank | id | category | score |
+------+------+------------+---------------------+
| 1 | 4 | facilities | 0.30000001192092896 |
| 2 | 5 | facilities | 0.27000001072883606 |
| 1 | 2 | food | 0.2900000214576721 |
| 2 | 3 | food | 0.25999999046325684 |
+------+------+------------+---------------------+
It’s important to note we're just focusing on ease of expression here, not performance. For this particular application, the scope is usually a few million vectors at most — so a full-scan, exact-nearest-neighbor approach is plenty fast.
When you choose a tool for vector processing for nearest-neighbor search applications like semantic search, chatbots, other LLM applications, face matching, object matching and more, we think it's a good idea to consider the power of the query language — and having full SQL available just makes things easier.
References
[Han23] E. Hanson and A. Comet, Why Your Vector Database Should Not be a Vector Database, SingleStoreDB blog, April 24, 2023.
Explore more vector database-related resources