Everybody has heard of the KISS principle -- Keep It Simple, Stupid. When applied to data management technology, it implies that you should use a simple solution when it works. Developers often believe that means using a single-node DBMS like MySQL, PostgreSQL, or SQL Server. So, what is a “single-node” database? Essentially, it’s a database designed for a single machine. The capacity of a single machine dictates the resource constraints of processing power, connections, and storage. If you need more processing power or storage, you can vertically scale, meaning you upgrade to a more powerful machine. This can work, up to the scale limits of the largest available machine. By contrast, scale-out databases are built as distributed databases from the start, that is, designed to be run across machines.
Sometimes people see scale-out relational database technology, like SingleStore, as strictly for high-end applications, beyond what one single-node DBMSs can handle. The thinking goes that they are inherently more complex to use than a single-node DBMS.
I'm here to tell you that the KISS principle is right. But people who think it means you should only use scale-out for "extreme" apps are dead wrong.
Let's look at the arguments for why scale out is not "simple," and address them one by one. We'll see that in many cases, scale out actually makes things simpler.
It's hard to get and set up and manage the hardware and software. First, database platform-as-a-service (PaaS) offerings like Singlestore Helios handle all that for you. You can choose a size and dial up and down whenever you need to. For self-hosting, using the public cloud or a well-organized enterprise data center with a range of hardware SKUs means you can pick out and provision machines fairly easily.
My single-node database is fast enough. For some problems with small data, a single-node DBMS may be fast enough. But there are thousands of applications out there with medium-to-large data on a single node system. The people who built them may think they are fast enough, but what if they could run queries instantaneously? Research shows that response time under ¼ second feels instantaneous, and that generates incredible user satisfaction. This drives users to explore more freely, learning more about the data, which helps them make better decisions. Your single-node DBMS might be able to provide near-instant responses for a few users, but what if there are many users?
Enterprises are pursuing digital transformation initiatives to get more out of the data they have, not just scale to handle bigger data sets. The levels of speed and concurrency you can get from scale-out enable new applications that unlock data's value, enabling digital transformation.
If my problem gets big, I can just add more nodes of my regular RDBMS. A common pattern for scaling applications is to create multiple databases on multiple nodes using a single-node DBMS. This could be done in a number of ways, such as\(a) putting each customer's data in a different database with the same schema, and spreading those databases across multiple nodes or\(b) creating replicas of the same database to scale out the workload.\Either way, this is anything but simple because you have to decide at the application level how to split up your data. Moreover, there might be a situation where you need more than one node to handle the workload for a single database. At that point, your single-node database runs out of steam.
My problem is not big enough to benefit from scale out. You'd be surprised about how small an application can be and still benefit from scale out. Here are a couple of examples. First, SingleStore has a customer with a few billion rows of data in a data mart that would easily fit on a single-node database. But they are extending quarter-second response time for dashboard refreshes to several hundred concurrent users, with intra-day updates, enabling a complete digital transformation in how the data is consumed. It's sort of a real-time data mart.
As a second example, we have customers that have less than a million rows of data but are using brute-force searches of vector data for AI image-matching applications using DOT_PRODUCT and EUCLIDEAN_DISTANCE functions in SQL. This brute force approach gives them better match fidelity than multi-dimensional vector indexing that's not available in SQL DBMSs to date, and still let's them integrate their match queries with other SQL query constructs. And see the later discussion about using brute-force scale out to simplify away the need for complex solutions like pre-calculated aggregate tables. Plenty of people with only a few hundred thousand rows of data can benefit from that.
It's hard to design my scale-out database to get it to perform. Yes, there are a couple of new concepts to learn with a scale-out database, mainly sharding (for partitioning data across nodes) and reference tables (for duplicating small dimension tables onto each node). But the horsepower you get from a good, high-performance scale-out database actually simplifies things. E.g. you may need materialized views or pre-calculated summary aggregate tables with a single-node database but not with a scale-out database. Pre-calculated aggregates and materialized views are tricky to use and introduce design problems that are conceptually harder than deciding how to shard your data. If you know when to use an index, you can learn how to shard your data and use reference tables in a few minutes. And you don't have to get it right the first time; it's easy to reconfigure (create a new table with the configuration you want, INSERT...SELECT... data into it, drop the old one, rename the new one, and you're done).
I can't find people with the skills to use a scale-out DBMS. Modern SQL-based scale-out databases are based on standard SQL, and are often compatible with MySQL or Postgres. SingleStore is largely compatible with MySQL, for example. So hundreds of languages and tools can connect to these SQL-based scale-out databases. And almost all of the SQL skills people have from working with SQL databases like MySQL and Postgres are transferable.
I want to use a general-purpose database and there are no general-purpose databases with scale out that work for me. A general purpose tool simplifies life, that's true, by making skill sets applicable to multiple problems, and reducing the effort to search for the right tool. Fortunately, there is a general-purpose SQL database that scales out and runs anywhere. I think you know what database that is.
I could go on, but you get the idea -- the spartan simplicity of relational databases and SQL carries its benefits over to scale-out SQL systems. And scale out simplifies lots of things, and enables digital transformation opportunities that can be valuable to your business and your career.
There's a corollary to the KISS principle that applies here also, often attributed to Albert Einstein: "Everything should be made as simple as possible, but no simpler." In this context, that means you shouldn't give up on really valuable application innovations made possible by the performance you can get from scale out, due to perceived complexity.
Finally, scale out is to improve speed and scalability. SingleStore scales out, but it also has other important technologies to make things faster, including in-memory row store tables, columnstores, compilation of queries to machine code, vectorization, and a high-performance plan cache. All of these things squeeze the most out of each processor core in your system.
So, next time you are about to reach for your trusty single-node DBMS, ask yourself, can I reach higher, and do more, and keep it simple at the same time?