The Winter 2022 Release of SingleStoreDB added performance improvements for transactional (OLTP) and analytical (OLAP) workloads — particularly, for working with JSON and string data.
This article will present several examples to demonstrate the speedup, so you can experience the performance improvements first hand.
Introduction
It’s no secret JSON is a wildly popular data format, used by developers and engineers all over. We’ve previously covered how SingleStoreDB supports JSON — with the latest innovation in 8.0 for seekable JSON. This enables JSON data to be stored in a compressed format, and when specific rows need to be retrieved, only the retrieved data needs to be decompressed. Figure 1 shows the previous approach for accessing JSON data (left-hand side) and the new approach (right-hand side). The new approach makes JSON suitable for both OLTP and OLAP applications, further cementing SingleStoreDB as the ideal, unified database.
For developers working extensively with JSON, that means there are no changes in data format or JSON Functions, as enhancements are under the covers. What developers will notice, however, are the performance gains. For management, the improved performance means SingleStoreDB becomes a more compelling business solution, removing the need to integrate multiple products or databases, leading to better Total Cost of Ownership (TCO).
Seekable JSON in Action
On SingleStore's YouTube Channel, there is a great 5-minute video demonstrating “Faster Seeks for JSON Data on Universal Storage” — and the demo results are impressive. Interested in trying it for yourself? Fortunately, all the details are provided in the SingleStore documentation, and you can find the demo code in the JSON Performance section.
One way we can test the code locally is to install SingleStoreDB in a virtual machine environment (you can find the details for how to do that here). You can also give this a try using the SingleStoreDB Dev Image, which contains all the required software.
Install SingleStoreDB
Let’s install SingleStoreDB in a virtual machine environment, which takes just a few minutes. For this article, we'll only need two tarball files:
Assuming a two-node cluster was correctly deployed and using the same variable names from the virtual machine set up, we can connect to our cluster from a MySQL CLI Client as follows:
mysql --local-infile -u root -h ubuntu -P 3306
--default-auth=mysql_native_password -p
Once connected to our cluster, we need to execute the JSON Performance demo code. The demo code produces over 8 million rows of data.
Results
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = OFF;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.46785009 |
+------------+
+------------+
| avg_time |
+------------+
| 0.49248448 |
+------------+
+------------+
| avg_time |
+------------+
| 0.45003322 |
+------------+
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = ON;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.00581854 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00237616 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00180840 |
+------------+
The results are summarized in Figure 2 using a Log Scale.
Taking the average, these results show we achieved over 140x performance improvement with seekable JSON with this small cluster set-up and environment.
Bonus: Using Los Angeles Police Department Crime Data
How does seekable JSON perform on real-world data? Let’s put it to the test with crime data provided by the Los Angeles Police Department — focusing on crime data from 2020 to present. The data can be downloaded in a CSV format, and is available under a Creative Commons CC0 1.0 Universal license. From the website:
“This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database.”
At the time of writing this article, the downloaded CSV file was approximately 163 MB.
Convert CSV to JSON
First, we need to convert the CSV file to JSON. With some help from ChatGPT, the following Python code achieves this, creating correct JSON records using double quotes, nulls and one JSON record per line:
import pandas as pd
import json
import ndjson
input_file = "rows.csv"
output_file = "rows.json"
df = pd.read_csv(input_file)
df.rename(columns = lambda x: x.replace(" ", "_"), inplace = True)
json_data = json.loads(df.to_json(orient = "records"))
with open(output_file, "w") as f:
ndjson.dump(json_data, f)
The following shows an example JSON record with the keys displayed and the values removed.
{
"DR_NO":
"Date_Rptd":
"DATE_OCC":
"TIME_OCC":
"AREA":
"AREA_NAME":
"Rpt_Dist_No":
"Part_1-2":
"Crm_Cd":
"Crm_Cd_Desc":
"Mocodes":
"Vict_Age":
"Vict_Sex":
"Vict_Descent":
"Premis_Cd":
"Premis_Desc":
"Weapon_Used_Cd":
"Weapon_Desc":
"Status":
"Status_Desc":
"Crm_Cd_1":
"Crm_Cd_2":
"Crm_Cd_3":
"Crm_Cd_4":
"LOCATION":
"Cross_Street":
"LAT":
"LON":
}
The JSON file was approximately 432 MB in size.
Results
You can find the complete code listed in the Appendix, which follows the structure of the earlier example. In the code in the Appendix, when loading the data into the database, we would replace /path/to/ with the actual path to where the JSON file was located. The JSON data load creates nearly 640,000 rows.
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = OFF;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.17954817 |
+------------+
+------------+
| avg_time |
+------------+
| 0.19996782 |
+------------+
+------------+
| avg_time |
+------------+
| 0.17837315 |
+------------+
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = ON;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.00427090 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00249579 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00249940 |
+------------+
The results are summarized in Figure 3 using a Log Scale.
Taking the average, these results show we achieved over 60x performance improvement with seekable JSON with this small cluster set-up, small dataset and environment.
Summary
Our simple performance tests demonstrate that seekable JSON can provide significant speedup. With mixed workloads consisting of both OLTP and OLAP, SingleStoreDB now supports fast access to JSON data in multiple environments, making it the product of choice for many applications.
Appendix
SET GLOBAL use_seekable_json = OFF;
CREATE DATABASE crimes;
USE crimes;
CREATE TABLE crimes (
id INT UNSIGNED PRIMARY KEY NOT NULL AUTO_INCREMENT,
crime_doc JSON NOT NULL,
SORT KEY(id)
);
LOAD DATA LOCAL INFILE '/path/to/rows.json'
INTO TABLE crimes
FORMAT JSON (
crime_doc <- %
);
DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
_id INT;
_od json;
_oid INT;
m INT;
st DATETIME(6);
et DATETIME(6);
BEGIN
SELECT MAX(id) INTO m FROM crimes;
st = now(6);
FOR i in 1..n LOOP
_oid = CEILING(m*RAND());
SELECT id, crime_doc
INTO _id, _od
FROM crimes
WHERE id = _oid;
END LOOP;
et = now(6);
echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n
AS avg_time;
END
//
DELIMITER ;
OPTIMIZE TABLE crimes FULL;
CALL get_avg_time(100);
SET GLOBAL use_seekable_json = ON;
CREATE TABLE crimes2 LIKE crimes;
INSERT INTO crimes2 SELECT * FROM crimes;
OPTIMIZE TABLE crimes2 FULL;
DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
_id INT;
_od json;
_oid INT;
m INT;
st DATETIME(6);
et DATETIME(6);
BEGIN
SELECT MAX(id) INTO m FROM crimes2;
st = now(6);
FOR i in 1..n LOOP
_oid = CEILING(m*RAND());
SELECT id, crime_doc
INTO _id, _od
FROM crimes2
WHERE id = _oid;
END LOOP;
et = now(6);
echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n
AS avg_time;
END
//
DELIMITER ;
CALL get_avg_time(100);