Since MemSQL doesn’t support the UUID data type, what data type is best suited for UUID values?
I guess either BINARY(16) or VARBINARY(32)
Just like CockroachDB and Cloud Spanner, MemSQL will benefit by a random UUID rather than a optimised COMB version, right?
The image below shows benchmark for a single-node database. Apperantly, binary is only best for smaller data sets, so this makes me think a pure UUID version 4 stored in a VARBINARY(32) column is the best choice for many MemSQL use cases.
What is the metric being measured on the chart you shared? Can you share more details of your benchmark? In line with your findings, I would expect VARBINARY or BINARY type to be best for storing UUID data.
CockroachDB and Cloud Spanner distribute data via range sharding. This means when using an optimized COMB UUID writes/reads of recent data would only hit a couple nodes. This is why random UUID’s work better since it prevents a specific node from getting overloaded. MemSQL distributes data via hash sharding and so doesn’t have this issue. As a result, from a data distribution perspective random UUID and COMB should be relatively equivalent.
From a storage perspective, I would expect COMB to be better in MemSQL since you will likely be indexing the UUID column in row store or sorting by it or a time dimension in columnstore. Since data would be arriving in a naturally sorted order, this would make it more efficient.
Thanks @rob, the difference between CockroachDB/Cloud Spanner vs MemSQL is really useful. I wasn’t aware of that.
Sorry for not providing the source of the benchmark visualization. I also should have mentioned that the benchmark was performed by Spatie on a MySQL server. Source: spatie/laravel-binary-uuid
I guess I have to perform a benchmark on MemSQL to see how BINARY and VARBINARY compares to each other both for a tradional random UUID and a COMB version.
I really like how PostgreSQL simplies working with UUIDs with the special UUID data type. It removes the need for converting between bytes and HEX strings in application. The CONVERT function will probably make it a little eaiser. Any chance MemSQL will support UUID natively in the near future?
UUID type is something we would like to do but haven’t committed to building yet.
Since UUID’s are fixed length, I think BINARY(16) would actually be better. This stackoverflow answer I think gives a pretty good strategy for inserting/querying data that would also be applicable to MemSQL.
If you do ultimately do some tests of your own, let us know how they go.
Binary UUID: BINARY(16) with regular UUID version 4
Ordered UUID: BINARY(16) with ordered UUID version 1
Comb UUID: BINARY(16) with comb UUID version 4
Textual UUID: CHAR(36) with regular UUID version 4
Each run selects a random UUID from the dataset 10,000 times with index and calculates the average time.
That’s very interesting that there is such a big difference in performance between the different types only in the 50k row dataset.
Could you share more details for the test setup? What was the hardware? Were the 10,000 queries being run concurrently or sequentially? What was the schema used for the test?
I found it very interessting as well. Perhaps I misses something. Maybe I should perform a OPTIMIZE TABLE ... FLUSH between seeding the table and reading?
I ran the test on a single VM with 4 vCPUs + 15 GB ram at GCP using cluster-in-a-box mode.
The benchmark runs sequentially and consists of basically 3 steps.
Create table
$this->connection->exec(<<<'SQL'
DROP TABLE IF EXISTS `binary_uuid`;
CREATE TABLE `binary_uuid` (
`uuid` BINARY(16) NOT NULL,
`text` TEXT NOT NULL,
KEY (`uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
SQL
);
For your benchmark, it looks like you are using rowstore so it optimize flush would not change things. One thing I would recommend changing in your test would be to make the uuid column the primary key of the table or at least the shard key.
We decided to go with char (32) to avoid hex and unhex operations, so we did never re-ran the benchmarks. Performance is good enough for us.
Integers is indeed fastest, but currently only 64 bit integers is supported. UUID uses 128 bit.
To get this speedup, join fact tables to dimension tables on integer columns and use group-by and aggregate operations in your queries. Using a star or snowflake schema with integer surrogate keys is a best practice and has been taught for years as part of the Kimball methodology, the most popular schema approach for data warehouses, data marts, and business intelligence systems.
If you have an analytical application and are joining on integer keys then grouping and aggregating, you may see stunning speed improvements, over 100X in some cases.
In case other developers comes by this topic, I’ll like to share another thing to consider based on our experience.
We decided to go with char(32) to avoid hexing/unhexing. Performance didn’t seem to be impacted by the data type compared to binary(16).
One thing we didn’t consider was the space consumed by the column. char(32) consumes 96 bytes while binary(16) only consumes 16 bytes (or 6 times less). Source: SingleStoreDB Cloud · SingleStore Documentation
One of our tables has grown to 50M rows causing the ID column to consume 4,5 GB memory. With binary(16) the same column would only have consumed 0,75 GB memory.
Based on this experience we’re considering migrating to binary(16) to save some space, but also because we expects performance to improve as the data set grows. Especially, JOINs is expected to improve with binary(16).
We’ll likely do a benchmark soon and post the results here.
We did a very simple test, comparing different data types and found that queries joining on binary or integers is significantly faster than the char data type.
My previously posted benchmark is definitely misleading based on our new knowledge. The benchmark was based on another’s benchmark script and didn’t utilized indexes.
With indexes, the three data types performs equally in simple index lookups cases.
If UUIDs is an requirement, like in our application, I would definitely go with the binary(16) over char(32). This will both improve performance and reduce space consumption. Otherwise, use integers whenever possible.
I noticed that there have been some increased interest regarding table data structures / keys especially for columnstore.
We do have a document regarding optimizing key guidelines and data structures, but it seems it does not satisfy everyone’s technical needs. We are curious how what you’d like to see improved in it, what’s missing, etc, so we could better serve our customers?
@Jacky I actually finds the mentioned document pretty informative already.
However, I do share the same question as @bvincent in his post:
We do aggregate a large amount of data with SUM, PERCENTILE_CONT etc. and have always wondered if it would benefit to put these columns in the clustered key as well. Segment elimination is already working very well and execution times is in overall great, so I doubt the order of the aggregation columns wouldn’t benefit further by being sorting into logical segments since the aggregating part has to scan all filtered rows anyway?
If anything should be improved, perhaps an additional paragraph to the filtering part. The examples states the columns in the clustered key should be the ones you’re filtering on (where clause), but I found that columns in a group by or join clause has huge benefit of being included in the clustered key, so does the order by columns, but it is already mentioned a little. I guess the joining columns might benefit of a hash index in MemSQL 7.0 rather than included in the clustered key?
I don’t see any storage improvement of using VARBINARY instead of VARCHAR for UUID.
By the way VARCHAR should take 1 byte per character or 2.bytes… E.g. Suppose with create table i have given varchar(130) and inserted ‘qwer1y1iopasdfghjklz3c3bwmqazcsweqcr’ data . I observed that it is taking 74 Bytes in system. Any explanation for this 74 Bytes?
@RobMem if you use BINARY(16) you will save space compared with VARCHAR(36) with 36-char GUIDs in string format with dashes. You have to convert the data first, of course, like as described in:
Naive/trusting implementation for UUID helper functions - you need one to convert to binary and other to get “human readable” UUID - although it never is human readable
CREATE FUNCTION UUID_ENCODE(UUID varchar(36) CHARACTER SET utf8 COLLATE utf8_general_ci NULL) RETURNS binary(16) NULL AS
BEGIN
RETURN UNHEX(REPLACE(UUID,’-’,’’));
END;
CREATE FUNCTION UUID_DECODE(A binary(16) NULL) RETURNS varchar(36) CHARACTER SET utf8 COLLATE utf8_general_ci NULL AS
DECLARE result VARCHAR(36) =’’;
BEGIN
result = hex(A);
– we need to insert dash in four places
RETURN LOWER(CONCAT_WS(’-’,SUBSTR(result,1,8),SUBSTR(result,9,4),SUBSTR(result,13,4),SUBSTR(result,17,4),SUBSTR(result,21,12)));
END;
I wonder how difficult is to implement this inside DB as standard function.