Full-text search (FTS) is a widely used feature that companies implement to filter and search through text data.
It’s been a core requirement for countless applications and workloads, and represents half of the hybrid search equation — something essential within gen AI, eCommerce and fintech contexts. Underneath it all, FTS is still the same as it was when IBM pulled it off in 1969: find matches based on a search string. SingleStore now has full-text version 2, taking this idea and modernizing to power any search you throw at it.
SingleStore has offered full-text search support since 2018, but we’ve significantly improved our capabilities in version 2 by leveraging the robust features of the Lucene library, open-source software under the Apache umbrella and a strong contender for defining the industry-standard for FTS. The library is written entirely in Java, but is used in SingleStore via the friendly and familiar SQL syntax — potentially saving you and your engineers a substantial learning curve.
If your application database is already in SingleStore, this enhanced full-text search capability allows you to perform searches directly over that data. This eliminates the need for a separate product solely for search purposes, avoiding additional fees, data transfer costs and potential failure modes associated with maintaining multiple systems.
Under the hood
The quality of a full-text search engine is defined largely by two factors: speed and accuracy. Those ever-important top results must be the best results, and they must be fast. SingleStore achieves this speed through the use of an inverted index, which places each unique search term and where they occur into a table.This indexing is done during data ingestion, minimizing overhead when searching through massive datasets later on.
The accuracy portion of the engine is another beast entirely. Each textual dataset will of course have different terms within it, so a robust universal scoring scheme is not possible. SingleStore’s full-text version 1 utilized a scoring scheme called the “Term Frequency-Inverse Document Frequency” (TF-IDF) system. Term frequency refers to how often a certain term occurs, whereas inverse document frequency refers to a term's rarity with respect to the rest of the text set.
Consider the following scenario: Two documents are returned in your search for elephants. One mentions elephants 200 times, whereas the other mentions elephants 100 times. Both are clearly valid results based on your search. For documents of similar length, the one that mentions elephants more times isn’t twice as relevant.
After augmenting the old system with considerations for normalized document lengths and term frequency saturation, you are left with BM25 scoring, which is sought after by knowledgeable full-text search users. BM25 refers to “Best Match 25” and builds on the foundation of TF-IDF scoring, making it a more robust and favorable algorithm. Without getting our hands too filthy, let’s discuss how a single BM25 score is calculated:
- Document length. Length of the document in words, |D|
- Term frequency. The frequency of the query term qi across a document D, f(qi, D)
- Average document length. The average word length of a document within the given dataset, avgdl
- Inverse document frequency. The weight of the query term qi, also called its rarity. Denoted below as IDF(qi)
- Free parameters. k1 and b are free parameters within the BM25 scoring scheme, and influence the normalization calculations. Typical values are k1 ∈ [1.2,2.0], b = 0.75.
Here is an example calculation of a single BM25 score. Take a deep dive on the logic, variable definitions and possible tuning here.
General use cases
Now that we know a little bit about how full-text search works, let’s dive into some great uses for it.
Log searching
While user-facing applications are leaning toward the semantic and hybrid searches that may include vector similarity search, log information remains stored in basic text formats. This makes agile full-text search a massively impactful tool for debugging and log analysis. Whether centralized or heavily distributed, SingleStore’s full-text engine can search petabytes of logs — accurately — in under a second.
JSON searches
With full-text version 2, customers can search entire columns of JSON documents as if they were laid out in paragraph form. SingleStore’s multi-model tables support the JSON datatype, which leads to the rarer capability of full-text search over semi-structured data. JSON documents can now be stored, queried, searched, analyzed and contextualized within one database. Not to brag, but this feat is normally achieved by adding three or more heavyweight components onto a company’s entire tech stack.
Fintech + martech
Full-text search can be a powerful tool in a fintech or martech stack, providing significant benefits for both financial services and marketing applications. An advanced full-text search engine across a company’s knowledge base, FAQ’s or product information could alleviate tons of potential stress from a customer. A solid user experience is of course a boon in its own right, but this would also lead to reduced support costs, increased customer knowledge and the potential of increased sales.
Types of search
Choosing the right type of full-text search is very important when it comes to performance and accuracy. Each type is equipped to handle different searches efficiently, and their results differ greatly. The following is a list of the main types of search that are shipping with full-text version 2.
Fuzzy search
Fuzzy searching is perhaps the most capable search type. At its core, it allows you to scan your data for anything close to your search key. It has three main metrics: number of edits, prefix length and transpositions. After rigorous testing, we determined the following default values would reduce the need for excessive query tuning:
- Number of edits: 2
- Prefix length: 0
- Transpositions: True
What do these mean?
- The number of edits refers to the amount of characters that can change while still maintaining validity. For example, a fuzzy search on the key ‘develop’ would return rows with the word ‘developer’, as only two changes occurred. Increasing the value to 3 would then catch ‘developers,’ which would have been omitted previously.
- The prefix length refers to how many letters have to remain unchanged at the beginning of the word. Change this parameter to 1, and searching for ‘book’ will no longer return entries with the word ‘cook’, as ‘b’ isn’t the first letter.
- Transpositions are common across human-entered data. This flag aids in catching common typos. It can be disabled at the query-level within the OPTIONS clause.
Proximity search
Your use cases might require a search for words that are commonly found near each other, rather than immediately adjacent. This is a key feature in SingleStore’s FTS. It scans for both strings within a full-text column, and checks to assure they occur within a specified distance from one another.
Range search
A range search is the best tool to scan a full-text column for numerical values. To scan columns where numerical values are intertwined with strings, including the lexicographic syntax will help isolate the numbers in question. Pro tip: this is especially helpful for isolating dated events across massive amounts of data. Of course, the typical number search across text, PDFs and JSON data is fully supported and just as swift.
Regex search
Perhaps a search term of a certain structure suits your needs. Generating a regular expression for that search can be done and injected into a SingleStore query with ease. For example: the regex [fa][o*]llow[s*]
would return columns with the words ‘allow’, ‘allows’, ‘follow’ or ‘follows’. This can be seen as a more targeted fuzzy search, with more complex regular expressions casting nets of your own design.
Additional tuning features like term boosting, wildcard characters, JSON support and engine variables can be found and are discussed at length in our docs.
Examples
A proper full-text search demonstration would require a reasonably large and diverse dataset that a blog post can’t provide. Still, let’s dive into a few examples to help demystify a few of the syntactic intricacies.
The workflow will be as follows:
- Create a table with a full-text index
- Demonstrate how to drop and recreate a full-text index on an existing table
- Insert data into the table
- Perform a full-text search on the data
Table creation
CREATE TABLE fts_test (id int,fname varchar(25)paragraph text,FULLTEXT USING VERSION 2 fts_index (fname, paragraph));
Index adding and dropping
Let’s say you have an old full-text version 1 index, or added another column that needs to be included in your full-text searches. Not to worry.
DROP INDEX fts_index ON fts_test;ALTER TABLE fts_test ADD FULLTEXT USING VERSION 2 fts_index (paragraph);
INSERT INTO fts_test (id, paragraph)VALUES(1, ‘Taylor’, ‘I had no idea SingleStore could search like this!’),(2, ‘Thomas’, Fast, high-quality searches are crucial my application.’),(3, ‘Kelsi’, ‘Fulltext search really has come a long way since the ‘60s.’);
SELECT id, fname, paragraphFROM fts_testWHERE MATCH (TABLE fts_test) AGAINST (‘paragraph:search~’);
Due to what we now know about full-text version 2, we would correctly assume that all three rows would be returned. The word ‘searches’ is within the default two edits from the key of ‘search’, and is therefore valid.
Proximity search
SELECT id, fname, paragraphFROM fts_testWHERE MATCH (TABLE fts_test) AGAINST (‘paragraph:”search since”~6);
Since ‘search’ and ‘since’ occur in the same row at a distance of six tokens from each other in the third entry, this would return the third row.
Range search
SELECT id, fname, paragraphFROM fts_testWHERE MATCH (TABLE fts_test) AGAINST (‘paragraph:[$50$ TO $100$]’);
This lexicographic range search would correctly return the third row as it contains the number 60. Omitting the ‘$’ wrapping would utilize a normal range search, which looks for integers and not strings — returning no rows.
Regex search
SELECT id, fname, paragraphFROM fts_testWHERE MATCH (TABLE fts_test) AGAINST (‘paragraph:/search[e*][d*][s*]/’);
This regular expression search would capture any rows with the words ‘search’, ‘searched’ or ‘searches’.
BM25 scoring
SELECT id, fname, paragraph, ROUND(BM25(fts_test, ‘paragraph:search’),3) ASscoreFROM fts_testWHERE MATCH (TABLE fts_test) AGAINST (‘paragraph:sea*’)ORDER BY score DESC;
This query is designed to demonstrate the power of both wildcards and BM25 scoring. In this case, the ‘*’ character will include any words with multiple letters after the prefix ‘sea’, and the ‘score’ variable will handle the ordering of the results based on the results’ proximity to the word ‘search’.
Try full-text search with SingleStore
SingleStore's full-text search version 2 represents a huge increase in search capabilities, offering powerful search mechanisms like fuzzy search, proximity search, range search and regex search. By leveraging the Lucene library and implementing the BM25 scoring algorithm, SingleStore has created a fast and accurate search solution that handles diverse use cases from log analysis to full JSON document searches.
This enhanced full-text search functionality allows organizations to perform complex searches directly on their data within SingleStore, eliminating the need for separate search products and simplifying their tech stack. With its SQL-based syntax and integration with SingleStore's multi-model database, full-text search version 2 provides a user-friendly — yet powerful — tool for developers and data professionals alike.
As data volumes continue to grow and search requirements become more sophisticated, SingleStore positions itself as a versatile and efficient solution for modern data-driven applications.