Is there a way to get a quick checksum based on the data row contents in an entire table.
I’m trying to find a quick way to validate that 2 databases are identical; or at least identify which table(s) have different row content… is there any information_schema table I could query for a checksum of the row contents for all rows in a table ?
I’ve experimented with using crc32( json_agg( to_json( table.* ) ) ) but unfortunately many of our tables are just too large for json_agg .
Would summing the crc32 of the to_json(table.*) of all rows be meaningful ?
Summing crc32 of the to_json(table.*) would be meaningful, yes. Consider something like this:
singlestore> select count(*), sum(crc32(to_json(r.*)) :> bigint) from r;
+----------+------------------------------------+
| count(*) | sum(crc32(to_json(r.*)) :> bigint) |
+----------+------------------------------------+
| 2 | 4358384770 |
+----------+------------------------------------+
1 row in set (0.05 sec)
singlestore> select * from r;
+------+------+
| a | b |
+------+------+
| 1 | 2 |
| 3 | 4 |
+------+------+
2 rows in set (0.06 sec)
singlestore> insert r values(100,200);
Query OK, 1 row affected (0.06 sec)
singlestore> select count(*), sum(crc32(to_json(r.*)) :> bigint) from r;
+----------+------------------------------------+
| count(*) | sum(crc32(to_json(r.*)) :> bigint) |
+----------+------------------------------------+
| 3 | 5782152428 |
+----------+------------------------------------+
1 row in set (0.00 sec)
Then if the count or the crc32 changes, the tables are different. You could do the count first before calculating the CRC32 to save time & only do the CRC32 if the count is equal.
I’m not sure the cast to bigint is necessary. I just did that to be safe to make sure the sum would be a bigint. That might already be happening by default.