Vector Data Type Support in SingleStore

Clock Icon

12 min read

Pencil Icon

Jan 19, 2024

Vector Data Type Support in SingleStore

Dive into a new vector type with SingleStore 8.5, designed to make building generative AI and RAG-based applications simpler than ever before.

There's been a tremendous surge in interest in vector databases and vector similarity search since ChatGPT was introduced in the fall of 2022. SingleStore is working to make our database the ideal data platform for semantic search and generative AI solutions to meet the requirements of today's application developers.

This includes the addition of support for both indexed approximate-nearest-neighbor (ANN) search for improved vector similarity search speed, and a new vector data type that makes it easier to build and maintain vector-based applications.

In this blog, we describe the new vector data type. We like what the developers of the PostgreSQL extension PGVector, another SQL system with vector support, did with respect to the type name and infix operators for vector similarity. So where possible, we've decided to stick to the same functional design. We think it's beneficial for people to use a SQL database instead of a specialty vector database (SVDB) when building AI and Retrieval Augmented Generation (RAG) applications. It'll help adoption of SQL for vector search if different SQL implementations use similar DDL and DML syntax.

Keep an eye out for an upcoming companion blog on ANN search for details on vector search speedup, plus more details on how people use vector similarity search to do RAG. RAG can dramatically improve AI chatbot result quality for specific knowledge areas.

what-did-we-have-before-the-vector-typeWhat did we have before the vector type?

SingleStore has supported vector operations since 2017, initially storing vectors within a BLOB column. To convert vectors represented as JSON arrays into a binary format, we employed the JSON_ARRAY_PACK built-in function. The reverse operation was accomplished using JSON_ARRAY_UNPACK. For instance:

CREATE TABLE data(vector BLOB);
INSERT INTO data VALUES(JSON_ARRAY_PACK('[1,2,3]'));
INSERT INTO data VALUES(JSON_ARRAY_PACK('[2,3,4]'));
SELECT vector, HEX(vector), JSON_ARRAY_UNPACK(vector) FROM data;
+--------------+--------------------------+---------------------------+
| vector       | HEX(vector)              | JSON_ARRAY_UNPACK(vector) |
+--------------+--------------------------+---------------------------+
|   �?   @  @@  | 0000803F0000004000004040 | [1,2,3]                   |
|    @  @@  �@  | 000000400000404000008040 | [2,3,4]                   |
+--------------+--------------------------+---------------------------+

-- Example of using SCALAR_VECTOR_MUL built-in.
SELECT JSON_ARRAY_UNPACK(SCALAR_VECTOR_MUL(2, vector)) FROM data;
+-------------------------------------------------+
| JSON_ARRAY_UNPACK(SCALAR_VECTOR_MUL(2, vector)) |
+-------------------------------------------------+
| [2,4,6]                                         |
| [4,6,8]                                         |
+-------------------------------------------------+

SELECT DOT_PRODUCT(vector, JSON_ARRAY_PACK('[1,2,3]')) FROM data;
+-------------------------------------------------+
| DOT_PRODUCT(vector, JSON_ARRAY_PACK('[1,2,3]')) |
+-------------------------------------------------+
|                                              14 |
|                                              20 |
+-------------------------------------------------+

problems-with-blob-data-typeProblems with BLOB data type

Several challenges arose when storing vectors using the BLOB data type. The fundamental concern was the lack of assurance that the data stored in the BLOB column was genuinely valid:

  • Arbitrary binary data. The BLOB column could house any binary data, leading to potential inclusion of erroneous information
  • Lack of data validation. There was insufficient data validation for vectors. Mistakenly inserting a vector with a different size than expected or containing numbers incompatible with the vector-specific type (e.g., NaN/INF for floating-point data types or a large number like 1234 for vector INT_8 type) was a possibility
  • Inserting data into the column. This required the use of an additional JSON_ARRAY_PACK or UNHEX built-in function call, which is a bit verbose and unintuitive.

The VECTOR built-in similarity functions, such as DOT_PRODUCT, check for size matches and throw an error on mismatch. However detecting these errors earlier during data ingestion into the table would be preferable.

Additionally, allowing various element types with VECTOR through suffix specification in built-ins like JSON_ARRAY_PACK_I16('[1,2,3]'), presented another challenge. If a decision to change the type from I16 to F32 was made, modifying all instances of built-ins in application queries and updating the suffix to the new type became necessary.

introducing-a-new-vector-data-typeIntroducing a new VECTOR data type

Starting with SingleStore release 8.5 we now have a new VECTOR data type that lets you insert, load and query vector data more easily. Here's the syntax for the type specification:

VECTOR(<N> [ , <elementType> ]

<N> is the number of elements (dimensions)

<elementType>  optional, is the vector element type; by default - F32.

Note: In SingleStore 8.5 we will support only F32 — support for other data types will come later. Here's an example of how to use this new table in a table schema:

CREATE TABLE data(embedding VECTOR(3));
/* add some vectors of length 1 */
INSERT INTO data VALUES('[0.267261237,0.534522474,0.801783681]');
INSERT INTO data VALUES('[0.371390671,0.557085991,0.742781341]');
INSERT INTO data VALUES('[0.424264073,0.565685451,0.707106829]');
SELECT embedding FROM data;
+---------------------------------------+
| embedding                             |
+---------------------------------------+
| [0.267261237,0.534522474,0.801783681] |
| [0.424264073,0.565685451,0.707106829] |
| [0.371390671,0.557085991,0.742781341] |
+---------------------------------------+

In this example, we are introducing the VECTOR data type for storing arrays of numbers of the same size. The embedding column is defined as a VECTOR with three elements (dimensions) of type F32 (floating-point numbers), which is used by default.

Notice that the INSERT operation didn't need to use JSON_ARRAY_PACK. The JSON arrays of numbers inserted into the vector column were implicitly converted to packed arrays of F32 values. We also didn’t have to use JSON_ARRAY_UNPACK to get the data back out in human-readable form. Just like with INSERT, this automatic conversion to packed vector format works with PIPELINES and LOAD DATA.

With the implementation of the VECTOR data type, data validation becomes possible during the ingestion process:

> INSERT INTO data VALUES('[0,1]');
ERROR 2856 (HY000): Invalid VECTOR value for column 'embedding'

> INSERT INTO data VALUES('[0,0.3,4avdsds]');
ERROR 2856 (HY000): Invalid VECTOR value for column 'embedding'

> INSERT INTO data VALUES('[0.2,0.3,0.4,0.5]');
ERROR 2856 (HY000): Invalid VECTOR value for column 'embedding'

infix-operators-for-dot-product-and-euclidean-distanceInfix operators for DOT_PRODUCT and EUCLIDEAN_DISTANCE

We added the infix operator <*> for DOT_PRODUCT and the <-> operator for EUCLIDEAN_DISTANCE to allow for a cleaner query syntax. These are equivalent to the existing built-in functions, but instead of calling them like DOT_PRODUCT(a,b) you can use a <*> b. For example, suppose we have a table data2 just like data but it uses a blob column instead of a VECTOR column.  To do a "top 2 closest matches" search, instead of:

SET @v = JSON_ARRAY_PACK('[0.267261237,0.534522474,0.801783681]');

SELECT JSON_ARRAY_UNPACK(embedding), DOT_PRODUCT(embedding, @v) AS s
FROM data2
ORDER BY s DESC
LIMIT 2;

you can now do:

SET @v = '[0.267261237,0.534522474,0.801783681]' :> VECTOR(3);
SELECT embedding, embedding <*> @v AS s
FROM data
ORDER BY s DESC
LIMIT 2;

We didn't add the <#> operator like the one PGVector uses for cosine distance because we historically asked people to use dot_product on vectors of length 1. This is equivalent to cosine similarity — not cosine distance — and our customers are used to that.

Cosine similarity ranges from -1 to 1 with 1 being the closest, while cosine distance ranges from 0 to 2 with 0 being the closest. Cosine_distance = 1 - cosine_similarity; notice that they sort in the opposite order. You need to use ORDER BY … DESC with cosine similarity to find the top K matches.

Also, the # in that PGVector syntax doesn't work with the MySQL-compatible command line tools people use to connect to SingleStore.

easy-vector-outputEasy vector output

Retrieving vectors is also easier with the VECTOR type. If the input is of the VECTOR data type, there is no need to use JSON_ARRAY_UNPACK when dealing with built-ins that return vectors now:

SELECT embedding FROM data;
+---------------------------------------+
| embedding                             |
+---------------------------------------+
| [0.371390671,0.557085991,0.742781341] |
| [0.424264073,0.565685451,0.707106829] |
| [0.267261237,0.534522474,0.801783681] |
+---------------------------------------+

transition-from-blob-to-vector-data-typeTransition from BLOB to VECTOR data type

For a ROWSTORE table, you can effortlessly change the column type from BLOB to VECTOR using the ALTER TABLE MODIFY statement:

DROP TABLE IF EXISTS data;
CREATE ROWSTORE TABLE data(embedding BLOB);
INSERT INTO data VALUES(JSON_ARRAY_PACK('[1,2,3]'));
INSERT INTO data VALUES(JSON_ARRAY_PACK('[2,3,4]'));
SELECT embedding, JSON_ARRAY_UNPACK(embedding) FROM data;
+--------------+------------------------------+
| embedding    | JSON_ARRAY_UNPACK(embedding) |
+--------------+------------------------------+
|   ??   @  @@ | [1,2,3]                      |
|    @  @@  ?@ | [2,3,4]                      |
+--------------+------------------------------+

ALTER TABLE data MODIFY embedding VECTOR(3);
SHOW COLUMNS FROM data;
+-----------+----------------+------+------+---------+-------+
| Field     | Type           | Null | Key  | Default | Extra |
+-----------+----------------+------+------+---------+-------+
| embedding | vector(3, F32) | YES  |      | NULL    |       |
+-----------+----------------+------+------+---------+-------+
SELECT embedding FROM data;
+-----------+
| embedding |
+-----------+
| [1,2,3]   |
| [2,3,4]   |
+-----------+

ALTER TABLE MODIFY is allowed only for ROWSTORE tables. To change the type of COLUMNSTORE table, we need to use a separate column. You can do this like so:

  • Add a VECTOR column to the right of the BLOB column
  • Update the VECTOR column with data from the BLOB column
  • Drop the BLOB column
  • Rename the VECTOR column to the old column name

This way, your queries will continue to work unchanged — or at least with fewer changes. Here's an example:

DROP TABLE IF EXISTS t;
CREATE TABLE t(c1 int, embedding blob, c3 int);
DESC t;

+-----------+---------+------+------+---------+-------+
| Field     | Type    | Null | Key  | Default | Extra |
+-----------+---------+------+------+---------+-------+
| c1        | int(11) | YES  |      | NULL    |       |
| embedding | blob    | YES  |      | NULL    |       |
| c3        | int(11) | YES  |      | NULL    |       |
+-----------+---------+------+------+---------+-------+

INSERT t values(1,JSON_ARRAY_PACK("[1,2,3]"), 100);
INSERT t values(2,JSON_ARRAY_PACK("[2,3,4]"), 200);
ALTER TABLE t ADD COLUMN emb2 vector(3) AFTER embedding;
UPDATE t SET emb2 = embedding;

Now, we have the data moved over to the new column:

SELECT c1, json_array_unpack(embedding), emb2, c3 FROM t;
+------+------------------------------+---------+------+
| c1   | json_array_unpack(embedding) | emb2    | c3   |
+------+------------------------------+---------+------+
|    1 | [1,2,3]                      | [1,2,3] |  100 |
|    2 | [2,3,4]                      | [2,3,4] |  200 |
+------+------------------------------+---------+------+

To get to our final destination, we have to drop the embedding column and rename emb2 to embedding:

ALTER TABLE t DROP COLUMN embedding;
ALTER TABLE t CHANGE emb2 embedding;
DESC t;

+-----------+----------------+------+------+---------+-------+
| Field     | Type           | Null | Key  | Default | Extra |
+-----------+----------------+------+------+---------+-------+
| c1        | int(11)        | YES  |      | NULL    |       |
| embedding |
vector(3, F32) | YES  |      | NULL    |       |
| c3        | int(11)        | YES  |      | NULL    |       |
+-----------+----------------+------+------+---------+-------+

The embedding column is in the middle, right where we want it. Everything is the same except the embedding column is of type vector. Along with the other benefits we get from the VECTOR column type, you can also now create an ANN index on this column.

powerful-casting-between-blob-and-vectorPowerful casting between BLOB and VECTOR

To minimize the need to change your existing applications — including to make it easy to use existing SQL queries — PSQL functions and built-in functions originally written for the BLOB type with VECTOR type data, we've implemented easy casting from BLOB to VECTOR and VECTOR to BLOB. For example, suppose you had this query in an application:

SELECT embedding, embedding <*> UNHEX("0000803F0000803F0000803F")
FROM data;

The UNHEX function is sometimes used to allow an application to pass an array from a programming language like C++ to the database by converting the array to hex, then passing that as a string into a SQL statement. In this query, the expression:

embedding <*> UNHEX("0000803F0000803F0000803F")

has a VECTOR type value on the left and a BLOB type value on the right. The blob is automatically cast to a vector of the appropriate length to match the length of the vector on the left. If the length is not correct, an error will be generated.

The SELECT statement generates this:

+-----------+-------------------------------------------------+
| embedding | embedding <*> UNHEX("0000803F0000803F0000803F") |
+-----------+-------------------------------------------------+
| [1,2,3]   |                                               6 |
| [2,3,4]   |                                               9 |
+-----------+-------------------------------------------------+

Here's a quick quiz. What is the floating point number with hex value 0000803F? You guessed it, 1! The cast in the expression just shown is automatic. You can also cast in both directions explicitly with the :> operator. For example:

SELECT "[1,2,3]" :> vector(3) :> blob :> vector(3);
+---------------------------------------------+
| "[1,2,3]" :> vector(3) :> blob :> vector(3) |
+---------------------------------------------+
| [1,2,3]                                     |
+---------------------------------------------+

You can see the original vector survives the round trip through cast to vector, blob and back to vector. Another casting use comes in handy when writing UDFs that need to operate on vectors of any length. For example, our documentation gives a PSQL function:

   normalize(v blob) returns blob

that normalizes a vector with an arbitrary number of elements to be length 1. You can use this function on vectors of a fixed length like so:

set @v = "[1,2,3]" :> vector(3);
select normalize(@v) :> vector(3);
+---------------------------------------+
| normalize(@v) :> vector(3)            |
+---------------------------------------+
| [0.267261237,0.534522474,0.801783681] |
+---------------------------------------+

Similarly, you can use casting to write helper functions that call another function like normalize() to operate on a vector of arbitrary length as a blob, e.g., this function normalizes a vector of length 1536 (the length of a vector from the OpenAI ada-002 model).

create function norm1536(v vector(1536)) returns vector(1536) as
begin
  return normalize(v);
end

future-plans-for-the-vector-typeFuture plans for the VECTOR type

We plan to continue to enhance the vector data type, including how to:

  • Make it work for more element types (like I8...I64 and F64, F16 and BFLOAT16)
  • Overload the vector math functions like VECTOR_SUB to work with infix operators like "-"
  • Make it possible to write functions that take a type like VECTOR(*) —  a vector of an arbitrary length.

conclusionConclusion

We've brought vectors and vector ANN search to SingleStore, a modern distributed SQL DBMS. This is a great foundation for new vector-oriented applications including LLM-based AI apps. We think these new features make us a better platform for modern data and AI applications than SVDB products, fulfilling our prediction from last spring that we'd be a great vector database long before the SVDBs are great databases.

What do you want to build with the new vector type and ANN index search capabilities in SingleStoreDB 8.5? Get started today, and activate your free SingleStore trial.


Share