The Scalable SQL, Full-Text and Vector Platform for Gen AI

Clock Icon

7 min read

Pencil Icon

Jul 10, 2024

Until now, database developers building an app with search requirements had tough choices to make.

The Scalable SQL, Full-Text and Vector Platform for Gen AI

They could choose several tools: one for full text, one for vector search and a SQL or NoSQL system for their main application data, and move data between them. But that involves extra complexity, more license fees and multiple learning curves.

Or they could choose one system — say a SQL or NoSQL system with  basic vector and text search capabilities, but some significant gaps. Then, they just have to live with them. That could lead to sleepless nights worrying about losing business to a competing application that provides a better search and gen AI experience.


We are proud to announce that with our new SingleStore 8.7 release, you don't have to compromise any more. We have completely re-implemented our full-text search engine based on the widely used and respected Apache Lucene (JLucene) platform. This is the same open-source engine inside ElasticSearch and Solr full-text search systems.

We also delivered high-performance vector search via approximate-nearest-neighbor indexing in January of this year, and have enriched it in 8.7 with filtered ANN support and a new vector range search feature.

full-text-enhancementsFull-text enhancements

We introduced full-text search six years ago. Our original implementation is based on CLucene, and the CLucene engine was compiled into the SingleStore service process. A lot has changed since then. Fuzzy search, proximity search, boosting, BM25 scoring and JSON search are required for many new full-text search apps — and we support all of this in SingleStore 8.7.

Our new full-text search implementation runs JLucene in a co-process next to the SingleStore service process on each node, communicating with our engine via shared memory and Unix domain sockets. All the data is stored inside the SingleStore database. By using shared memory, we minimize data copying to achieve excellent performance. And, our engine is fully distributed. We have a partitioned database model where each node contains one or more database partitions (also known as shards). Our JLucene indexes are local to a single database partition.

Here's a diagram of the node architecture of SingleStore. The memsqld process is the SingleStore database service process, implemented in C++. The name dates back to our prior company name, MemSQL.

full-text-search-on-jsonFull-text search on JSON

SingleStore's JSON data type has gotten more and more popular since it was introduced, and is now used by over 95% of our cloud customers. People love mixing semi-structured data in with their structured data. They like getting the benefits of the relational model and SQL (for flexible query processing and data independence) as well as JSON in the same system — benefits you can't fully enjoy with a NoSQL system.

Until the 8.7 release, you could not do full-text search over JSON fields. You could create a persisted computed TEXT-type column and index that, but you had to store the data twice. Now, you can put full-text indexes on JSON columns directly.

Here's an example of creating a full-text index over text and JSON data:

CREATE TABLE t (
id INT UNSIGNED,
title VARCHAR(200),
records JSON,
SORT KEY(id),
FULLTEXT USING VERSION 2 (title, records)
);

The USING VERSION 2 notation indicates that a JLucene full-text index should be created. The original version will continue to be supported but is deprecated, and new development should use version 2.

Let's demonstrate full-text search on JSON, and fuzzy search in one example.

insert t values(1,"the tempest",'{"author":"Shakespeare"}');
insert t values(2,"the grapes of wrath",'{"author":"Steinbeck"}');
select * from t where match(table t)
against("records$author:steenbeck~");
+------+---------------------+------------------------+
| id | title | records |
+------+---------------------+------------------------+
| 2 | the grapes of wrath | {"author":"Steinbeck"} |
+------+---------------------+------------------------+

In SingleStore full-text search version 2, you use the MATCH() AGAINST() notation to specify a full-text filter or expression. The MATCH() clause takes TABLE <tablename> as an argument, and uses the one full-text index on that table. The AGAINST() clause takes a JLucene query string as an argument and passes it to the JLucene engine. JSON documents are indexed such that JSON property names can be referenced, and searched explicitly and separately. In this case, we're searching within the "author" property. The tilde (~) character at the end of "steenbeck" means to do a fuzzy search based on Levenshtein distance with a default edit distance of two.  

Proximity search allows you to search for words that appear within a specified number of words from each other. For example:

create table s(id int, t text, fulltext using version 2(t));
insert s values(1, "red green blue");
insert s values(2, "red yellow here is some text purple");
select * from s where match(table s) against('t:"yellow purple"~4');
+------+-------------------------------------+
| id | t |
+------+-------------------------------------+
| 2 | red yellow here is some text purple |
+------+-------------------------------------+

1 row in set (0.16 sec)

select * from s where match(table s) against('t:"yellow purple"~3');

Empty set (0.01 sec)

bm-25-scoringBM25 scoring

The BM in BM25 stands for "Best Matching." This scoring method is superior to the term frequency/inverse document frequency (TF-IDF) method, in part because it does a better job of dealing with document length as well as term saturation. Good handling of term saturation means that as a term appears more and more times in a document, its importance for the final search weighting does not add much additional weight. Hence, the document becomes "saturated" with respect to that term.

For example, suppose the term "cheetah" appears in one document 200 times and a second document 100 times. Is the first document really twice as important in satisfying information needed about cheetahs? Clearly not. BM25 gives an intuitive result for these kinds of term frequency situations. SingleStore returns a per-segment BM25 score for MATCH() AGAINST() in version 8.7.

boostingBoosting

Boosting of terms is supported with our new full-text search engine. You can specify a weight or boosting factor with some — or all — of the terms in your query. This allows you to rank documents with favored terms higher.

For example, this query boosts red by two and yellow by four:

select *, match(table s) against('t:red^2 t:yellow^4') as score from
s;
+------+-------------------------------------+--------------------+
| id | t | score |
+------+-------------------------------------+--------------------+
| 2 | red yellow here is some text purple | 0.7845874428749084 |
| 1 | red green blue | 0.2615291476249695 |
+------+-------------------------------------+--------------------+

This query is the same, but with no boosting:

select *, match(table s) against('t:red t:yellow') as score from s;
+------+-------------------------------------+---------------------+
| id | t | score |
+------+-------------------------------------+---------------------+
| 2 | red yellow here is some text purple | 0.2615291476249695 |
| 1 | red green blue | 0.13076457381248474 |
+------+-------------------------------------+---------------------+

You can see the distance between the scores for rows with id values 2, and 1 has increased in the boosted version.

vector-search-enhancementsVector search enhancements

We improved indexed ANN search in 8.7 by adding support for vector range search. Vector range search is a powerful tool to help you avoid some pitfalls of indexed ANN search for the "top K" closest values. For example, suppose you want to find all the items with a vector similarity score that are "really close" to your query vector.

This is hard to do with "top K" search because there might be 100 items that are "really close" — or none. So you don't really know what K to pick. But you might, for example, know that "really close" means a dot_product() score of 0.95 or higher. In that case, you can use a query like this:

select id, txt, vec <*> @query_vec as score
from t
where score > 0.95;

The score > 0.95 clause in this query can be evaluated with an ANN index if there is one on the vec column. The system expands its search through the ANN index until scores drop under 0.95, then stops.

You can confirm that the ANN index is being used for range search by locating the notation like this in the EXPLAIN or PROFILE plan:

INTERNAL_VECTOR_RANGE_SEARCH(0, @query_vec, 0.95) index

We've also added support for filtered ANN search which is really useful for top-K queries with filters. It allows you to easily get exactly K answers, even when you have filters in a WHERE clause. For example, suppose you have a query like this:

select id, v <*> @qv as score
from t
where x = "blue"
order by score desc
limit 5;

Suppose that x = "blue" is moderately selective, so that only 10% of the rows qualify. In SingleStore 8.5, with default settings, you might run this query and get less than five rows back — even if five rows exist that match the filter. In SingeStore 8.7, when using filters with top-K ANN queries, the system automatically expands search through the ANN index until it gets enough rows to satisfy the LIMIT K clause.

putting-it-all-together-for-modern-intelligent-apps-with-gen-aiPutting it all together for modern, intelligent apps with gen AI

Developers have high goals for their apps today —  they want to combine the best of transaction processing, analytics, search and gen AI, working together to make a great end-user experience. Why should developers have to suffer through the cost and complexity of using multiple data stores, or settle for less for their end users? We say they shouldn't have to do either.

Modern, intelligent apps have:

  • Forms to fill out and buttons to click to run transactions
  • Analytics components (like showing trends and summaries)
  • Search components (think the search bar in a shopping tool)
  • Gen AI components (like "summarize feedback about a product")

SingleStore lets you build apps like this at any scale, all in one system. No need to wheel in PostgreSQL, copy data to Elastic and Snowflake and bolt on Pinecone for semantic search and RAG. Save money, time and don't pull your hair out. If you need transactions, analytics, search and gen AI  — in other words, if you need to build an intelligent app — you need SingleStore.

Start free today.


Share