I just upgraded my cluster from 7.3.5 to 7.5.6 and it destroyed my cluster.
The precheck was ok, then the upgrade went fine on the first leaf but failed to detach the second leaf.
Now the second leaf can’t start anymore with error:
05970401 2021-07-29 10:35:27.748 FATAL: Thread 115118: jumpToUpgradeStep: This node is not managed by a supported tool. Please use a toolbox version at least as new as 1.11.3. : Failed to connect to MemSQL: process exited: exit status 1
The toolbox version installed is 1.11.9
How to solve this please? my cluster is in a weird state where the nodes don’t have same versions anymore.
Thanks for your help
PS: all nodes throw the same error after server reboot, and the cluster can’t start anymore
UPDATE: found out that the file /var/lib/memsql/XXXX/memsql.cnf contained a toolbox_version = xx property that was outdated and was responsible of the upgrade failure.
The cluster is back online but the leaf that failed to upgrade is using 100% CPU and all the partitions are impacted.
The tracelogs on the failed leaf is streaming millions of errors like:
ERROR: Replication Management Thread Worker (the_db_name): Thread 112458: ProcessSingleMaster: Failed to process slaves for master database the_db_name (async non-fatal failure)
In the Studio, most of the partitions are marked as “Impacted”, what does it mean?
In your text, you said you upgraded to 7.3.6 but the title says 7.5.6. I think you mean 7.5.6. Please clarify. Sorry to hear about your trouble. I will ask one of the developers to take a look.
Hello pierre,
We looked at the toolbox_version issue and found where the problem is coming from. Currently working on a fix for it in Tools. Thanks for reporting the issue.
Regarding your second issue, it might be because not all nodes are upgraded. Please check to make sure all nodes are on the new version.
Also, we are curious to know why detach failed in your upgrade in the first place. It would be great if you can send us a cluster report at bug-report@memsql.com
How to finish the upgrade please? the nodes now appear with the latest version after I rebooted them, but clearly partitions on the second leaf are messed up… CPU is still at 100% the whole cluster is kinda frozen.
I created a fresh 7.5.6 cluster and started to restore my backups.
On the leaf the tracelogs/memsql.log is also streaming millions of errors like this:
2015587609 2021-07-29 22:48:34.954 ERROR: Replication Management Thread Worker (my_db_name_0): Thread 115063: ProcessSingleMaster: Failed to process slaves for master database my_db_name (async non-fatal failure)
Feels like it’s a bug in the new release…
CLUSTER REPORT:
The cluster report is frozen on both clusters (upgraded one & fresh one), can’t get it.
The fresh cluster was set on Debian 10, 3 Google Cloud VMs with 8 cores (1 MA + 2 Leafs with HA) and all OS recommended optimisations (THP disabled etc…).
Even though I’m not yet paying for the license, this issue seem too serious to be left on the community support… imagine the impact of such topic on new customers reading the forum.
EDIT: the cluster report completed after few hours, sent it to bug-report@memsql.com
Thanks @mpskovvang for sharing your experience with this version.
The Singlestore team is actually investigating my cluster report and the bug seems to come from an SSL layer update in 7.5 that causes troubles between leaves… that would also explain why a fresh cluster had similar issue.
Yes, there is an SSL performance regression in 7.5.6. Its small enough not to be noticeable if SSL is configured only between aggregators and the application (the typical configuration), but if SSL is enabled intra-cluster between the leaves the slow down is pretty drastic. Disabling intra-cluster SSL resolved things for Pierre. We have a fix in testing now and it will be released in the first 7.5 patch in a week or so.