Hi -
I’m using MemSQL 6.7.1. One of the things we’ve noticed is that, under certain circumstances, a transaction can be partially committed and not properly rolled back. This happens when we’re:
- Inserting multiple records in a single transaction
- The records are distributed across multiple partitions / leaf nodes
- One of the leaf nodes loses connection / has a failure
The error we would see looks like this:
Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details
With a little bit of setup, you can reproduce this locally using some scripts to simulate a lost server together with a network connection failure. The sql commands end up executing like so:
memsql> start transaction;
Query OK, 0 rows affected (0.00 sec)
memsql> insert into facts_test_metrics values
-> (1, 1, 1, 1547406001, 1),
-> (2, 2, 2, 1547406002, 1),
-> (3, 3, 3, 1547406003, 1),
-> (4, 4, 4, 1547406004, 1),
-> (5, 5, 5, 1547406005, 1),
-> (6, 6, 6, 1547406006, 1),
-> (7, 7, 7, 1547406007, 1),
-> (8, 8, 8, 1547406008, 1),
-> (9, 9, 9, 1547406009, 1),
-> (10, 10, 10, 1547406010, 1),
-> (11, 11, 11, 1547406011, 1),
-> (12, 12, 12, 1547406012, 1),
-> (13, 13, 13, 1547406013, 1),
-> (14, 14, 14, 1547406014, 1),
-> (15, 15, 15, 1547406015, 1),
-> (16, 16, 16, 1547406016, 1),
-> (17, 17, 17, 1547406017, 1),
-> (18, 18, 18, 1547406018, 1),
-> (19, 19, 19, 1547406019, 1),
-> (20, 20, 20, 1547406020, 1),
-> (21, 21, 21, 1547406021, 1),
-> (22, 22, 22, 1547406022, 1),
-> (23, 23, 23, 1547406023, 1),
-> (24, 24, 24, 1547406024, 1),
-> (25, 25, 25, 1547406025, 1),
-> (26, 26, 26, 1547406026, 1),
-> (27, 27, 27, 1547406027, 1),
-> (28, 28, 28, 1547406028, 1),
-> (29, 29, 29, 1547406029, 1),
-> (30, 30, 30, 1547406030, 1),
-> (31, 31, 31, 1547406031, 1),
-> (32, 32, 32, 1547406032, 1),
-> (33, 33, 33, 1547406033, 1),
-> (34, 34, 34, 1547406034, 1),
-> (35, 35, 35, 1547406035, 1),
-> (36, 36, 36, 1547406036, 1),
-> (37, 37, 37, 1547406037, 1),
-> (38, 38, 38, 1547406038, 1),
-> (39, 39, 39, 1547406039, 1),
-> (40, 40, 40, 1547406040, 1);
Query OK, 40 rows affected (2.01 sec)
Records: 40 Duplicates: 0 Warnings: 0
memsql> commit;
ERROR 1856 (HY000): Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details.
memsql> show warnings;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer) |
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer) |
| Error | 2004 | Leaf Error (127.0.0.1:3313): Error reading from socket. Errno=104 (Connection reset by peer) |
| Error | 1856 | Lost connection to a leaf while committing transactions in the cluster. This transaction may be partially committed. Run SHOW WARNINGS for more details. |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
memsql> select count(1) from facts_test_metrics;
ERROR 1777 (HY000): Partition reporting:4 has no master instance.
memsql> rollback;
Query OK, 0 rows affected (0.00 sec)
memsql> select count(1) from facts_test_metrics;
+----------+
| count(1) |
+----------+
| 37 |
+----------+
We’re wondering if there’s anything that can be done to recover in a scenario like this. The only things we can think of are:
- Stop inserting multiple records in a single transaction (effectively stop using transactions).
- Stop inserting records across multiple partitions in a single transaction (requires knowledge of how records get distributed)
- Delete existing records that match some custom identifier before attempting the transaction again (if we detect a partial commit error from memsql)
Any other thoughts? Has this scenario been improved in newer versions of MemSQL? Any better way to approach this?
Thanks for any advice!