This post originally appeared on the Novus Tech Blog
More and more, companies are migrating from NoSQL databases back to relational, ACID-compliant databases that also offer high availability and horizontal scalability – aka “NewSQL.” Novus Partners is no exception, having recently completed our migration from MongoDB to SingleStore.
Of course, NoSQL developers often come to depend on rapid development cycle and “schema-less” data models, which can make the transition back to relational – including the prospect of devising a relational model for your schemaless data – daunting. It turns out, however, that SingleStore offers a feature that enales you to keep many of those benefits and ease the NoSQL-to-NewSQL transition: the JSON column type.
In this article, we’ll get up and running quickly with SingleStore, and then immediately learn about its native JSON support and what it can do.
Getting Started with SingleStore
Although SingleStore is a proprietary technology (binary download only), they recently released a free Community Edition that is perfect for developer-testing. All you need are a 64-bit Linux environment and the MySQL client (SingleStore made a strategic decision to implement the MySQL client protocol – “bug-for-bug”).
After downloading and installing the Community Edition, and the MySQL Client, you will want to alias memsql to the MySQL client command as below (I have this alias in my ~/.local.bash:
$ alias memsql
alias memsql='mysql -u root -h 127.0.0.1 -P 3306 --prompt="memsql> "'
$ memsql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 15
Server version: 5.5.8 SingleStore source distribution (compatible; MySQL Enterprise & MySQL Commercial)
Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
If you already know MySQL, you already know most of SingleStore commands and syntax. SingleStore adds some proprietary commands and syntax, mostly related to its replication and sharding functionality, which we won’t discuss here. If you want to learn more, you can find all their documentation online.
Let’s start out by inspecting the default databases, and then define our own “test” database where we can explore SingleStore’s JSON support:
memsql? show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| memsql |
| sharding |
+--------------------+
3 rows in set (0.00 sec)
memsql> create database test;
Query OK, 1 row affected (2.21 sec)
memsql> use test;
Database changed
Prototyping a Schema
Perhaps in our existing NoSQL database, we had a document collection named tasks that contained, among other things, unstructured data related to our distributed compute application. Our developers added and removed fields from their Task objects as needed for different compute jobs, which was great from a development perspective because it meant rapid development cycle without the need for frequent database schema changes and migrations.
Fortunately, we don’t have to leave that agility behind when we transition from NoSQL back to SingleStore’s relational model.
The simplest version of our new tasks table has two columns: a bigint primary key and the column for JSON data, which has the type… JSON
memsql> create table tasks ( task_id bigint not null primary key auto_increment, task json not null );
Query OK, 0 rows affected (15.53 sec)
memsql> describe tasks;
+---------+------------+------+------+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------+------+------+---------+----------------+
| task_id | bigint(20) | NO | PRI | NULL | auto_increment |
| task | JSON | NO | | NULL | |
+---------+------------+------+------+---------+----------------+
2 rows in set (0.00 sec)
Now, let’s insert some data into this table. The task documents themselves are just JSON literals inside single-quotes, like any other data type:
memsql> insert into tasks (task) values ('{"user" : "nzucker", "uid" :{"clientId" : 1, "which" : "P", "id" : 205} } ');
Query OK, 1 row affected (0.00 sec)
memsql> insert into tasks (task) values ('{"user" : "nzucker", "uid" :{"clientId" : 7, "which" : "P", "id" : 1009} } ');
Query OK, 1 row affected (0.00 sec)
memsql> insert into tasks (task) values ('{"user" : "bqunibi", "uid" :{"clientId" : 9, "which" : "P", "id" : 327} } ');
Query OK, 1 row affected (0.00 sec)
memsql> select * from tasks;
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} |
| 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} |
| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |
+---------+---------------------------------------------------------------+
3 rows in set (0.00 sec)
An important note: SingleStore will not let you insert invalid JSON documents. Attempting to do so yields an error:
memsql> insert into tasks (task) values ('{"user" : ');
ERROR 1844 (HY000): Leaf Error (10.20.79.111:3307): Invalid JSON value for column 'task'
JSON Field Predicates
Suppose we want to select rows based on a specific JSON document field. Even though the SingleStore JSON column type is something like a LONGTEXT column in terms of data format and limitations, the database query engine understands the JSON specification and how to navigate the document tree.
For example, we can select all the tasks that I previously inserted having user of ‘nzucker’
memsql> select * from tasks t where t.task::$user = 'nzucker';
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} |
| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |
+---------+---------------------------------------------------------------+
2 rows in set (2.71 sec)
The clause t.task::$user navigates to the user field of the JSON document, with the $ prefix ensuring that the value is evaluated as a STRING data type (it is actually short-hand for the JSON_EXTRACT_STRING function).
Naturally, SingleStore supports predicates that use nested JSON fields. For example, we can also query by the uid.clientId field:
memsql> select * from tasks t where t.task::uid::%clientId = 7;
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |
+---------+---------------------------------------------------------------+
1 row in set (0.00 sec)
The % prefix on ::%clientId ensures that the field is interpreted as a double (the only numeric data type supported by the JSON standard), which is important if you are doing numeric comparisons in your queries.
memsql> select * from tasks t where t.task::uid::%clientId < 9;
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} |
| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} |
+---------+---------------------------------------------------------------+
2 rows in set (0.00 sec)
You can also select “sub-documents” from JSON columns using a similar syntax:
memsql> select task::uid from tasks;
+--------------------------------------+
| task::uid |
+--------------------------------------+
|{"clientId":7,"id":1009,"which":"P"} |
|{"clientId":1,"id":205,"which":"P"} |
|{"clientId":9,"id":327,"which":"P"} |
+--------------------------------------+
3 rows in set (2.73 sec)
Because we’re extracting JSON documents from our JSON column no \$ or % prefix is required in the task::uid clause.
Persisted Columns
SingleStore also supports persisted columns (also known as “computed columns”) extracted from JSON documents fields. This feature is very convenient if you find yourself repeatedly querying the same deeply-nested JSON document fields.
memsql> alter table tasks add column client_id as task::uid::clientId persisted bigint;
Query OK, 0 rows affected (8.90 sec)
Records: 0 Duplicates: 0 Warnings: 0
memsql> select * from tasks;
+---------+---------------------------------------------------------------+-----------+
| task_id | task | client_id |
+---------+---------------------------------------------------------------+-----------+
| 3 |{"uid":{"clientId":7,"id":1009,"which":"P"} ,"user":"nzucker"} | 7 |
| 2 |{"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 |
| 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} | 9 |
+---------+---------------------------------------------------------------+-----------+
3 rows in set (2.75 sec)
Here we persisted the client_id field, extracted from the task object. Note that if our task documents have a suitable primary key field (say, a field named _id) we could extract that field to populate task_id field.
Now let’s select using this persisted column:
memsql> select count(*) from tasks where client_id = 9;
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (2.27 sec)
If you plan your persisted columns in advance, you will make life easier for developers or analysts who may not be familiar with JSON structure. Also, you can create indexes using these JSON-derived persisted columns, which obviously has a huge benefit.
Updating JSON Document Fields “In-Place”
If you have used JSON persistence from an object-oriented language, you might have written code like the following:
- 1. Fetch the entire JSON document out of the store.
- 2. Deserialize the JSON document into an object.
- 3. Update a single field on the object.
- 4. Serialize the entire object back to the store as JSON.
That’s quite a bit of data transfer just to manipulate a single field. Well, since we can select database rows by JSON fields, why not update individual fields of JSON documents as well? This too is possible:
memsql> update tasks set task::uid::$which = 'F' where task_id = 3;
Query OK, 1 row affected (0.00 sec)
memsql> select * from tasks where task_id = 3;
+---------+---------------------------------------------------------------+-----------+
| task_id | task | client_id |
+---------+---------------------------------------------------------------+-----------+
| 3 |{"uid":{"clientId":7,"id":1009,"which":"F"} ,"user":"nzucker"} | 7 |
+---------+---------------------------------------------------------------+-----------+
1 row in set (0.00 sec)
It’s worth mentioning that SingleStore doesn’t support JSON field-level validation, so “in-place” updates such as this run the risk generating data that violates your domain model. For example, if the task::uid::which field is required, but you set task::uid::\$which = NULL, your application may encounter errors due to the missing field. So, use this feature with caution (and perhaps a robust set of integration tests).
Manipulating JSON Arrays
Another great feature of SingleStore’s JSON support is the ability to manipulate JSON arrays in-place, using the JSON_ARRAY_PUSH_ expression.
Let’s continue with our “task” documents example by defining their domain model in Scala as follows:
case class UID(clientId: Int, which: String, id: Int)
case class Task(user: String, uid: UID, history: List[String])
As with any other JSON field, SingleStore array manipulation functions enable us to add entries to a JSON array “in-place.” Using our example data set, we first update one of our documents to have a field called “history,” initialized with an empty JSON array:
memsql> update tasks set task::history = '[]' where task_id = 2;
Query OK, 1 row affected (2.96 sec)
Then can then we insert into the array and observe the results:
memsql> update tasks set task::history = JSON_ARRAY_PUSH_STRING(task::history, "New") where task_id = 2;
Query OK, 1 row affected (0.00 sec)
memsql> select * from tasks;
+---------+--------------------------------------------------------------------------------+-----------+
| task_id | task | client_id |
+---------+--------------------------------------------------------------------------------+-----------+
| 3 |{"uid":{"clientId":7,"id":1009,"which":"F"} ,"user":"nzucker"} | 7 |
| 2 |{"history":["New"],"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 |
| 4 |{"uid":{"clientId":9,"id":327,"which":"P"} ,"user":"bqunibi"} | 9 |
+---------+--------------------------------------------------------------------------------+-----------+
3 rows in set (0.00 sec)
memsql> update tasks set task::history = JSON_ARRAY_PUSH_STRING(task::history, "InProgress") where task_id = 2;
Query OK, 1 row affected (0.00 sec)
memsql> select * from tasks where task_id = 2;
+---------+---------------------------------------------------------------------------------------------+-----------+
| task_id | task | client_id |
+---------+---------------------------------------------------------------------------------------------+-----------+
| 2 |{"history":["New","InProgress"],"uid":{"clientId":1,"id":205,"which":"P"} ,"user":"nzucker"} | 1 |
+---------+---------------------------------------------------------------------------------------------+-----------+
1 row in set (0.00 sec)
Finding the Last Element of the Array
Perhaps we want to find the most recent update in our task document’s history array. One (inefficient) approach is to extract the entire array, and serialize it into an array object, and find the last element:
memsql> select task::history from tasks where task_id = 2;
+----------------------+
| task::history |
+----------------------+
| ["New","InProgress"] |
+----------------------+
1 row in set (2.18 sec)
This certainly works in a pinch, but a better solution is to obtain the last element – representing the most recent status – directly in the SQL. Thank you to Justin Fu at SingleStore Support, who provided this solution:
memsql> select task_id, JSON_EXTRACT_JSON(task::history, JSON_LENGTH(task::history) - 1) as latest_status
-> from tasks
-> where task_id = 2;
+---------+---------------+
| task_id | latest_status |
+---------+---------------+
| 2 | "InProgress" |
+---------+---------------+
1 row in set (1.24 sec)
Performance Considerations
A quick word about performance: JSON is a plain-text-based data format and therefore will always be slower than an optimized, binary format (see Thrift and ProtocolBuffers). At Novus, our migration path was to continue using Salat for Scala case class-to-JSON serialization, which in turn uses json4s for JSON AST operations. For most cases, this was painless and Just Worked.
However, in some cases, we encountered human-perceptible performance degradation after moving from MongoDB to SingleStore. Typically, this occurred when attempting to deserialize thousands of JSON documents into Scala objects while processing a SELECT result. Although many developers know Mongo as “just JavaScript,” it’s actually not. The storage and wire format are the carefully planned and optimized BSON format. Given our application was now transferring and parsing JSON text rather than BSON, this slow-down was completely understandable.
So, the trade off was to sacrifice human-readability for performance, reverting to a binary format for the performance-sensitive functionality. We successfully used Twitter Chill for this purpose, storing the objects in a BLOB database column. Another option is to bite the bullet and devise a relational model for your object model, particularly if the design is stable.
Either way, be sure to focus on functionality and domain modeling first, before turning to performance optimizations.
Thank you to Carlos Bueno (@archivd) and the rest of the SingleStore team for feedback on earlier drafts of this article.