Vector databases are designed for efficient storage, retrieval and similarity search of high-dimensional vector data. Using a process called embedding, vector data is represented in a continuous and meaningful high-dimensional vector space, usually referred to as an embedding space.
In this article, we examine practical approaches for storing/retrieving vector data and performing similarity search, especially in light of generative AI applications. We will also highlight key capabilities where SingleStoreDB outshines other vector-capable databases.
Before we dive deeper, let’s outline the capabilities that make up a vector database:
Ability to perform similarity searches
When given a query vector, a vector database can retrieve the most similar vectors based on a specified similarity metric, such as cosine similarity or Euclidean distance. This allows applications to find relevant items or data points based on their similarity to a given query.
Retrieve vector data with high performance
Vector databases often employ indexing techniques, typically Approximate Nearest Neighbor (ANN) algorithms (e.g., Locality-Sensitive Hashing or Product Quantization), to accelerate the search process. These indexing methods aim to reduce the computational complexity of searching in high-dimensional vector spaces, where traditional methods like spatial decomposition become impractical due to high dimensionality.
The landscape of vector databases
We look at five approaches for persisting and retrieving vector data
- Pure vector databases like Pinecone
- Full text search databases like ElasticSearch
- Vector libraries like Faiss, Annoy and Hnswlib
- Vector-capable NoSQL databases like MongoDB, Cosmos DB and Cassandra
- Vector-capable SQL databases like SingleStoreDB or PostgreSQL
Apart from the five main approaches mentioned above, there are AI/ML platforms such as Vertex AI and Databricks whose capabilities go beyond databases and for this reason, I exclude them in this analysis.
The Landscape of Vector Databases
In this already crowded and rapidly expanding landscape of vector databases, how do you weigh your options? Let’s discuss the advantages and limitations of each approach. I promise to be as objective as possible!
1. Pure Vector Databases
Pure vector databases are specifically designed to store and retrieve vectors. Examples include Chroma, LanceDB, Marqo, Milvus/ Zilliz, Pinecone, Qdrant, Vald, Vespa, Weaviate, etc.
In pure vector databases, data is organized and indexed based on the vector representation of objects or data points. These vectors can be numerical representations of various types of data including images, text documents, audio files or any other form of structured or unstructured data.
Advantages of pure vector databases
- Efficient similarity search with indexing techniques
- Scalability for large datasets and high query workloads
- Support high-dimensional data
- Support HTTP & JSON-based APIs
- Native support for vector operations including addition, subtraction, dot product, cosine similarity
Disadvantages of pure vector databases
- Vector-only: Pure vector databases can store vectors and some metadata, but little else. For most enterprise AI use cases, you may require including data such as descriptions of entities, properties and hierarchies (graph), location (geospatial), etc.
- Limited or no SQL support: Pure vector databases usually employ their own query language, making it hard to run traditional analytics on vectors and associated information — or combine vector and other data types.
- No full CRUD. Pure vector databases are not really designed for create, update and delete operations. For read operations, data must first be vectorized and indexed for persistence and retrieval. These databases focus on ingesting vector data, indexing it for efficient similarity search and querying for nearest neighbors based on vector similarity.
- Indexing is time consuming. Indexing vector data is computationally heavy, expensive and time consuming. This makes it hard to use fresh data for generative AI applications.
Forced tradeoffs. Based on the indexing technique used, vector databases require customers to make tradeoffs between accuracy, efficiency and storage. For instance, Pinecone’s IMI index (Inverted Multi-Index, a variant of ANN) creates storage overheads, and is computationally intensive. It is primarily designed for static or semi-static datasets, and can be challenged if vectors are frequently added, modified, or removed. Milvus uses indexes called Product Quantization and Hierarchical Navigable Small World (HNSW), which are approximate techniques that trade off search accuracy for efficiency. Moreover, its indexing requires configuring various parameters and using incorrect parameter choices may impact the quality of search results or introduce inefficiencies.
- Questionable enterprise features. Many vector databases lag sorely behind on basic features including ACID transactions, disaster recovery, RBAC, metadata filtering, database manageability, observability, etc. This can lead to serious business problems — similar to this customer who lost all their data.
For many customers, the limitations of vector databases will boil down to price performance. Given the compute-heavy nature of vector operations, OSS vector databases or vector libraries becomes viable alternatives for especially large-scale applications.
2. Full-text search databases
This category includes databases such as Elastic/Lucene, OpenSearch and Solr.
Advantages
- High scalability and performance, especially for unstructured text documents
- Rich features for text retrieval such as built-in foreign language support, customizable tokenizers, stemmers, stop lists and N-grams
- Based on open-source library (Apache Lucene)
- Large ecosystem of integrations, including with vector libraries
Limitations of full-text search databases for vector data
- Not optimized for vector search or similarity matching
- Designed for full-text search, not semantic search, so applications built on it won’t have full context for Retrieval Augmented Generation (RAG) and other use cases. To achieve semantic search capabilities these databases require augmentation with other tools, and heavy custom scoring and relevance models.
- Limited applications for other data formats (images, audio, video)
- Lack GPU support
3. Vector libraries
For many developers, open-source vector libraries such as Faiss, Annoy and Hnswlib are a good place to start.
Faiss is a library for similarity search and clustering of dense vectors. Annoy (Approximate Nearest Neighbors Oh Yeah) is a lightweight library for ANN search. Hnswlib is a library that implements the HNSW algorithm for ANN search.
Advantages of open-source vector libraries
- Fast nearest neighbor search
- Built for high dimensionality
- Support ANN oriented index structures including inverted files, product quantization and random projection
- Support use cases for recommendation systems, image search and NLP
- SIMD (Single Instruction, Multiple Data) and GPU support to speed up vector similarity search operations
Limitations of open-source vector libraries
- Burdensome maintenance and integration
- Sacrifice search accuracy compared to exact methods
- Bring your own infrastructure. Vector libraries are memory and compute hungry, and they need you to build and maintain complex infrastructure to provision enough CPU, GPU and memory resources for application needs.
- Limited or no support for metadata filtering, SQL, CRUD operations, transactions, high availability, disaster recovery, and backup and restore
4. Vector-capable NoSQL databases
This category includes:
- NoSQL databases like MongoDB, Cassandra/ DataStax Astra, and CosmosDB.
- Key-value databases like Redis
- Other special purpose databases like Neo4j (graph)
Nearly all of these NoSQL databases have only recently become vector capable by adding extensions for vector search.
Advantages
- For their specific data models, NoSQL databases offer high performance and scale. Neo4j (a graph database) can be used in conjunction with LLMs for social networks or knowledge graphs. A vector-capable time-series database — like kdb — may be able to combine vector data with financial market data.
Limitations
Vector capabilities of NoSQL databases are basic/nascent/untested. Many NoSQL databases added vector support just this year. In May, Cassandra announced plans to add vector search. In April, Rockset announced support for basic vector search, and Azure Cosmos DB announced vector search support for MongoDB vCore in May. DataStax and MongoDB announced vector search capabilities just this month (both in preview)!
- Vector search performance of NoSQL databases can vary widely, depending on the vector functions, indexing methods and hardware acceleration supported.
5. Vector-capable SQL databases
This category consists of a very small set of databases — SingleStoreDB, pgvector/Supabase Vector (beta) for PostgreSQL, Clickhouse, Kinetica and Rockset. We expect more popular databases to pile on to this list as it’s not a heavy lift to add basic vector capabilities to an established database. In fact, the vector database Chroma emerged from ClickHouse.
Update: In September 2023, Oracle announced vector search capabilities as well.
Advantages of vector-capable SQL databases
- Power vector search with functions such as dot product, cosine similarity, Euclidean distance and Manhattan distance.
- Use similarity scores to find K-Nearest neighbors
- Multi-model SQL databases offer hybrid search, and can combine vector with other data for more meaningful results
- Most SQL databases can be deployed as a service, fully managed on any major cloud.
Limitations of SQL databases for vector data processing
- SQL databases are designed for structured data. The corpora behind generative AI applications substantially comprises unstructured data — like images, audio and text. While relational databases can usually store text and blobs, most do not vectorize this unstructured data for use in machine learning.
- Most SQL databases are not (yet) optimized for vector search. The indexing and querying mechanisms of relational databases are primarily designed for structured data, rather than high-dimensional vector data. While the performance of SQL databases for vector data processing may not be exceptional, vector-capable SQL databases are likely to add extensions or new functionality to support vector search. For instance, while SingleStoreDB supports exact k-NN search, we intend to add ANN search to improve performance on very large, high dimensionality datasets.
- Traditional SQL databases do not scale out and as such, their performance degrades as data grows. Handling large datasets of high-dimensional vectors with SQL databases may require you to do additional optimizations, like partitioning the data or employing specialized indexing techniques to maintain efficient query performance.
SingleStoreDB: A Robust, Full-Context Vector Database
As discussed, each category of databases described have advantages and limitations. These databases (and others) may attempt to address limitations with extensions, toolkits and new features. The performance and usability of these extensions is yet to be seen or proven.
SingleStoreDB provides a simpler, more powerful approach to handling vector data. It allows you to store and query vector data alongside traditional structured data, providing a unified platform for various types of queries and analysis. As a distributed SQL database, SingleStoreDB is also highly performant, highly available and can scale out to adapt to growing data sets.
SingleStore has supported over a dozen vector functions since 2017! These include dot_product for cosine similarity, Euclidean distance, vector normalization and various vector arithmetic functions. SingleStore customers deploy vectors in production use cases — just a few of which include LiveRamp, Siemens, Lumix.ai, Thorn and Nyris. Use cases span semantic search, face matching, product catalog search and surveillance (see the resources section for details).
Why SingleStore Is a Better Vector Database
Vector database type | Why SingleStore is better |
Pure vector databases (Pinecone) |
|
Full-text search databases (ElasticSearch) |
|
Vector libraries (Faiss) |
|
Vector-capable NoSQL databases (MongoDB) |
|
Vector-capable SQL databases (pgvector for PostgreSQL) |
|
Vector Database Use Cases with SingleStoreDB
SingleStoreDB features built-in exact neighbor vector similarity search. This is useful for a number of AI applications, including:
- Image and video processing. SingleStoreDB enables applications like reverse image search, content-based image retrieval, image classification and video similarity analysis.
- Natural language processing. With its support for keyword-based, full-text search and vector-based semantic search, SingleStoreDB enables:
- Text/document retrieval and similarity matching
- Generative AI on enterprise data including Q&A systems
- Recommendation engine. By finding the nearest neighbors based on user preferences or item attributes, you can use SingleStoreDB to build recommendation systems to suggest similar items to users, enhancing browsing or shopping experiences.
- Anomaly detection. Vector similarity search in SingleStoreDB can be used in anomaly detection systems to identify unusual or anomalous data points.
- Entity resolution. Vector similarity search in SingleStoreDB can identify similar data items describing an entity — such as a person —even without exact matches. By combining scores for comparisons of multiple properties of an entity, partial descriptions can be matched to an entity with high confidence.
See the resources section for more information on getting started with AI use cases.
Benefits of Using SingleStoreDB as a Vector Database
SingleStoreDB is simpler, less expensive and can be more powerful than vector-only/ NoSQL/ full-text search databases. SingleStoreDB can mix and match metadata, SQL and JSON, time-series data and do aggregations all in one shot. This opens up enterprise gen AI use cases where:
- Generated answers are based on public and an enterprise-owned corpora of data
- Answers are tailored based on the asker’s role (is the person asking an unverified user, customer, partner or employee?)
- Hallucinations are prevented by using Retrieval-Augmented-Generation (RAG)
These types of AI applications are impractical to achieve with other vector databases.
Full text? Even better — full context
Use all data relevant to your company. Combine vector data from text, images, audio, video, etc., with other kinds of data including logs, stock market data, clickstream and sensor data. This is made possible because all kinds of structured and unstructured data can be co-located in SingleStore– vectors, text, SQL, JSON, time-series and geospatial data. Users can leverage a combination of vector and full-text search features.
- Connect and ingest data from other sources. SingleStoreDB supports a wide range of data sources and connectors, allowing users to ingest data from diverse systems including other databases, HDFS, message queues, log files, cloud storage ( Amazon S3) and streaming data platforms like Confluent Kafka.
- Re-ranking semantic search results are made easy with ‘dot_product’ and ‘match’ support.
Rich query language
- SQL allows powerful metadata filtering, joins, aggregates, subqueries, window functions and other language features.
- SingleStoreDB can do fast K-Nearest-Neighbor search with ‘order by/limit k’ queries using ‘dot_product’ and ‘euclidean_distance’ metrics, combined with arbitrary SQL for metadata filtering.
Simpler than pure vector databases
- Deploy a vector database without the added complexity, licensing costs or extra training requirements of a pure vector database.
- Run on-premises and on any major cloud as a fully managed service
- Quickly prototype and deploy
- Get data security, compliance and disaster recovery fit for enterprise use cases
The author would like to thank Eric Hanson for his valuable contributions to this article.
Start Using SingleStoreDB as Your Vector Database
- For more information about SingleStoreDB as a vector database, see singlestore.com/built-in-vector-database and our documentation on Working with Vector Data.
- Contact us to book a consultation with an expert at SingleStore.
- Start a free trial here
Resources to get started with vector data/AI use cases on SingleStore
Generative AI
- How to Build a ChatGPT App on Your Own Data
- How to Use Large Language Models (LLMs) on Private Data: A Data Strategy Guide
- How to use SingleStore in a full-stack Chat GPT app
- Using OpenAI with SingleStoreDB to store and query vectors of fine food reviews
- Using ChatGPT for Questions Specific to Your Company Data
- Getting OpenAI Embeddings in SQL Using External Functions
- LangChain Lift-off: Launch Your Open Source GPT Apps Today
Image matching and classification
- Image Matching in SQL With SingleStoreDB
- Using SingleStore DB, Keras and TensorFlow for image classification
- Nyris.io uses SingleStoreDB for computer vision to identify products. See their product demo here.
Natural language processing
- Siemens builds AI-powered semantic search in SingleStoreDB for sentiment analysis on HR survey data
Recommendation engine
- Using SingleStoreDB, Spark and Alternating Least Squares (ALS) to build a Movie Recommender System
Code Samples
Other resources to help choose your Gen AI tech stack
- Selecting the Optimal Database for Generative AI
- Why Your Vector Database Should Not be a Vector Database
- Why You Shouldn’t Invest In Vector Databases
- DB-Engines ranking of vector DBMS
- Full-Text Search vs. Semantic Search: The Good, Bad and Ugly