Support mission-critical workloads with SingleStoreDB: Part 1
SingleStore is a high-performance, distributed SQL database that is designed to power your real-time applications. It offers ultra-fast ingestion (10+ million rows per second), millisecond-level response time for OLTP queries and sub-second latency for analytical queries on historical data. It also eliminates the need for multiple purpose-built data engines with its distributed architecture and patented universal storage engine.
This allows you to support both your transactional and analytical workloads, as well as diverse datasets, including relational, key-value, JSON and vector.
SingleStoreDB offers a wide range of mission-critical capabilities for enterprise data engineers and SaaS application developers. This two-part series will walk you through how to use SingleStoreDB to build your next mission-critical application. In part 1, we will focus on performance and build an OLTP database using examples. In part 2, we will delve into data integrity and fault-tolerance capabilities of SingleStoreDB including ACID transactions, concurrency control, high-availability and disaster recovery.
Your toolkit for faster apps
To kick things off, we will focus on performance. I will explain the key features SingleStoreDB offers to build a high-performance transactional application.
- Sharded tables. SingleStore divides databases into partitions (shards), which are horizontal slices of data distributed across the leaf nodes in your system. We utilize partitions as a unit of data distribution, enabling improved performance and resource utilization. To achieve enhanced performance and scalability, you can leverage sharding to evenly distribute the data within a table across these partitions.
By doing so, you can optimize your system for distributed query execution, resulting in better parallelism and load balancing across different nodes in the cluster. This approach allows you to achieve greater efficiency and performance gains.
In SingleStoreDB, the Universal Storage feature is an evolution of our columnstore table type that provides a flexible table format for supporting both transactional and analytical workloads. This can reduce your Total Cost of Ownership (TCO), and allow you to implement new application architectures for operational analytics that were not possible before. When data is initially ingested, it is stored in in-memory rowstore segments. The rowstore data is optimized for high-performance OLTP workloads by storing all fields for a given row together in the same physical location. This enables millisecond query latencies.
As data in a Universal Storage table grows, we migrate it from in-memory rowstore segments to on-disk (SSDs) storage using a columnstore format. In the columnstore format, data for each column segment (data for one column from a million-row chunk of data) are stored in the same blob on disk. This arrangement enables efficient data scans and compression, further enhancing performance and storage efficiency.
Rowstore tables. These tables are best for your operational and transactional workloads like point lookups, small inserts, updates and deletes. In SingleStoreDB, rowstore tables store data in memory to provide millisecond — or even sub-millisecond — response times. Rowstore tables are suitable when highest performance and real-time data access are critical, and cost to keep data in RAM is not a major concern. For situations where your data size exceeds available RAM, we recommend a combination of rowstore and columnstore tables, or just columnstore tables, to balance performance and memory requirements. Visit our documentation to learn more about rowstore tables.
- Indexes. Creating indexes is the best way to improve your database query performance for transactional workloads. SingleStoreDB offers various indexes for both your rowstore and columnstore tables, including:
- Skip list indexes. These are our default index type for rowstore tables. Skip list indexes are similar to B-tree indexes in that they keep data ordered by a key, offering faster insertion and query speeds by optimizing for in-memory performance. To learn more about skiplists, check out this blog.
- Hash index. This is another valuable tool to improve your query performance when used with equality lookups. For example, a hash index can be used to quickly find all rows in a table where a particular column has a specific value (e.g., WHERE customer_id = 12345). You can create a hash index in a table with a “USING HASH” clause. For example: CREATE TABLE hash_index_table (id INT, value INT, INDEX(id, value) USING HASH). Or, you can use CREATE INDEX or ALTER TABLE to create one after you create the table. For more details, head to our documentation).
- Full-text index. This index can be created on a columnstore table to search for words or phrases in a large body of text. Full-text indexes can only be created on char, varchar, text or longtext columns, and you can use either an exact or "fuzzy" search syntax. By creating full-text indexes on certain columns, you significantly improve the speed of queries that involve searching for specific words or phrases within those columns. Explore more examples of working with full-text search.
- Sub-segment seeking into columnstore. Sub-segment seeking is another valuable tool offered in SingleStoreDB to enhance query performance. It allows the query engine to directly seek into the segments containing the required values, rather than scanning an entire segment, which can contain up to one million rows.
For example, let's consider a products table with a hash index on product ID. If you want to retrieve all the rows with a particular product ID (WHERE product_id = 456), SingleStoreDB simply seeks directly into the segments for each column referenced in the query, retrieving the column values for just one or a few records that meet the WHERE clause, instead of scanning the entire segment. This approach provides up to 400x boost in query performance.
This sub-segment seeking into columnstore tables allows you to implement transactional applications that retrieve and update a few rows quickly — with high concurrency — on this disk-based table structure that is excellent for analytics. This is a unique feature of SingleStoreDB.
Column Group Index (CGI). This special index is designed to enhance the retrieval speed of rows in wide columnstore tables. In our testing, CGI increased the transactional throughput on wide tables by 6-8x. Our performance in seeking rows with tens of columns is already quite good, with response times in the single-digit millisecond range.
However, when dealing with tables that have hundreds of columns, CGI becomes essential in accelerating both row retrievals and updates. Accessing individual rows from columnar tables with many columns incurs a non-trivial amount of overhead, including CPU time, and when the data does not fit in the in-memory buffer cache, I/O.
The CGI creates an on-disk, row-oriented representation of the table data, residing alongside the column blobs. During query execution, the engine optimally combines the Column Group and the columnstore to deliver an efficient execution plan. For more information, explore documentation and a blog on the topic.
Let's build an OLTP database with examples
Let's put all of these concepts together to build an OLTP database. We'll use orders and line items tables in universal table format (columnstore) as an example. Universal Storage tables are a good choice when you have large tables that are bigger than RAM, while still ensuring that your database performs well to support your transactional workloads. They offer rowstore-like performance while lowering the cost for OLTP workloads.
SingleStoreDB also uses a partitioning strategy to optimize query processing. When creating column store tables, the default partitioning strategy is one partition per CPU core. However, you can configure the partitions as either 1:1 or 1:4 (partitions to CPU cores ratio), or some other ratio. You can do this at the time of creating a database using the PARTITIONS argument. For OLTP workloads or workloads that often run more parallelized queries, we recommend using more than one partition per leaf.
Column store tables
The orders_CS table will contain columns such as order_ID, customer_ID, order_date, order_status and total_price. The line_Items table will contain columns such as order_ID, lineitem, product_ID, quantity, price, and suppkey (supply key).
For the orders_CS table, we will use order_id as the shard key. Since order_id is unique, we can use this column to distribute data evenly across partitions. For the line_Items table, we will also use the same shard key (order_id). This way, rows in the orders_CS and line_items tables that have the same order_id will be saved in the same partitions. This will result in significantly faster query results when we are doing joins by making them local, also known as "collocated."
CREATE TABLE orders_CS (order_id bigint(11) NOT NULL,customer_id int(11) NOT NULL,order_date date NOT NULL,order_status char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,SHARD KEY (order_id),SORT KEY (order_date));CREATE TABLE line_items (order_id bigint(11) NOT NULL,lineitem int(11) NOT NULL,product_id int(11) NOT NULL,quantity decimal(15,2) NOT NULL,price decimal(15,2) NOT NULL,suppkey int(11) NOT NULL,SHARD KEY (order_id));
Hash index
Let's create a hash index on the customer_id column in the orders_CS table and on the order_id column in the line_items table. With this hash index you can quickly find the exact match of the data by examining only the segments identified by the hash function, instead of scanning the full tables.
CREATE INDEX customer_id_index ON orders(customer_id) USING HASH;CREATE INDEX order_id_index ON line_items(order_id) USING HASH;
To take advantage of the hash index, the query must have an equality filter for every column in the index. Let’s look at the following join:
SELECT o.order_id, o.order_date, o.total_price, l.product_id, l.quantity, l.priceFROM orders_CS oJOIN line_items l ON o.order_id = l.order_idWHERE o.customer_id = 8530495;
To compare the query performance with a hash index, we first ran the preceding query on the S-0 workspace without creating a hash index on column_id and order_id. There were 272.4M rows in the line_items table and 150M rows in the orders table. The query resulted in 47 rows and took 688 ms to execute.
The majority of the time (389 ms out of 688 ms) was consumed by the Line_items column store filter, which had to perform a full table scan of 272.4 million rows, filter 4.2K rows,and join those 4.2k rows with the 24 rows from the orders_CS table. The column filter on the orders_CS table also consumed 108 milliseconds to filter down 24 rows from a total of 150 million rows. See appendix A for a visual profile.
Now, let's look at the same query with a hash index on column_id and order_id. The size of the tables remained the same and we still ran the query on the S-0 workspace. This time, the query ran in six milliseconds (a 100x improvement). The equality predicate took advantage of the hash index on orders_CS.customer_id.
Therefore, instead of performing a full table scan, the query engine was able to look at only 22.5 million rows (vs 150M total rows) and filter these relevant rows in one millisecond. Similarly, the query engine used the line_items.order_id index to filter and perform a hash join within three milliseconds. See the appendix B for a visual profile of this query using hash indexes.
Column group index
Some of our customers have over 300 column tables. Seeking specific rows from hundreds of columns can be expensive. Let's assume we added multiple columns to our existing orders table. You can create a column group index on all columns as follows:
ALTER TABLE orders_CS ADD Column (orderpriority char(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,shippriority int(11) NOT NULL,comment varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,-- ...);ALTER TABLE orders_CSADD COLUMN GROUP orders_cg_index (*);
Now let's consider querying all the columns in the orders_CS table based on a specific product order ID.
SELECT * FROM orders_CS WHERE order_id = 8530495;
The query engine uses both columnstore and column group indexing efficiently to execute this query. First, it uses the filter to find all the products with order ID 8530495. Using columnstore and the associated hash index on the order_id column is still the most efficient way to filter on a single column.
After filtering the order ID column, the query engine leverages the column group index to read only the necessary full rows, maximizing I/O efficiency. Column group scanning can also be used to speed up updates, where the entire row must be read and the new value is promoted to the in-memory rowstore segment.
Rowstore tables
Now, if you need the best possible performance for your OLTP workloads, even if it means paying a bit more for servers with enough RAM, then in-memory row store tables are the way to go. Let’s create orders and line items tables again, this time as row-store tables.
-- Creating orders tableCREATE ROWSTORE TABLE orders_RS (order_id bigint(11) NOT NULL,customer_id int(11) NOT NULL,order_date date NOT NULL,order_status char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,total_price decimal(15,2) NOT NULL,SHARD KEY order_id (order_id));-- Creating line_items tableCREATE ROWSTORE TABLE line_items_RS (order_id bigint(11) NOT NULL,lineitem int(11) NOT NULL,product_id int(11) NOT NULL,quantity decimal(15,2) NOT NULL,price decimal(15,2) NOT NULL,suppkey int(11) NOT NULL,SHARD KEY __SHARDKEY (order_id),SORT KEY __UNORDERED ());
Skiplist index
Skiplists are indexes primarily used for rowstore tables. They improve query performance by sorting rows into smaller and smaller ordered lists, allowing for fast data lookups and range filters through search that takes time proportional to log(N) for N indexed rows. Therefore to improve the query performance, we created skiplist indexes on the customer_id
column in the orders_RS
table and on the order_id
column in the line_items_RS
table. The syntax for creating skiplist indexes is:
CREATE INDEX customer_id_index ON orders_RS(customer_Id);CREATE INDEX OrderId_Index ON line_items_RS(order_Id);
Lets run the below query with the skiplist indexes. In this example we are running the tests on an S-0 workspace with 1M rows for orders_RS table and 2M rows for line_items_RS table.
Sample query joining orders_RW and line_items_RS tables and filtering on customer_ID:
SELECT o.order_id, o.order_date, o.total_price,o.customer_id, l.product_id, l.quantity, l.priceFROM orders_RW oJOIN line_items_RW l ON o.order_id = l.order_idWHERE o.customer_id = 7345735;
This query completed in one millisecond (4x faster than without the index). Analyzing the query profile (see appendix C for screenshot), the index range scan on the orders_RS table took less than one millisecond. This is because the query optimizer automatically selected a NestedLoopJoin, and combined with the indexes we created, the query engine only had to scan relevant rows with matching order_ids to find all the matching rows.
Conclusion
To recap, we have looked at key concepts for building faster apps with SingleStoreDB. For example, we have looked at how SingleStoreDB shards tables for higher performance and better resource utilization, the unique value of Universal Storage and rowstore tables and various indexes that can improve your query performance.
You can get started with SingleStoreDB completely free. Then, you can follow this simple example to create a database and load data using Pipelines. Once you have created a database, you can use this blog to learn how to create appropriate shard, sort and index keys and test the performance of the queries yourself. In part 2 of this sequence (coming soon), we will learn about SingleStoreDB's transactional and fault-tolerance capabilities.
To further explore the features and benefits of SingleStoreDB, we invite you to check out the SingleStore Training page which includes self-paced courses like schema design, data ingestion, creating indexes, optimizing queries and more.
Appendix
Appendix A: Visual Profile for running the below query without a hash Index. The query execution time was 688ms.
Query:
SELECT o.order_id, o.order_date, o.total_price, l.product_id, l.quantity, l.priceFROM orders_CS oJOIN line_items l ON o.order_id = l.order_idWHERE o.customer_id = 8530495;
Appendix B: The same query (as in appendix A) was executed but now with a Hash index on column_id and order_id. This query was executed in 6ms (100x better query performance).
Query:
SELECT o.order_id, o.order_date, o.total_price, l.product_id, l.quantity, l.priceFROM orders_CS oJOIN line_items l ON o.order_id = l.order_idWHERE o.customer_id = 8530495;
Appendix C : Rowstore Tables
Visual Profile for running the below query with a skiplist Index. The query execution time was 1 ms.
SELECT o.order_id, o.order_date, o.total_price,o.customer_id, l.product_id, l.quantity, l.priceFROM orders_RW oJOIN line_items_RW l ON o.order_id = l.order_idWHERE o.customer_id = 7345735;