Hello,
My customer says one of the node of the cluster goes randomly offline. We say twice a week. He is right because I can see that event logged in the events section in Memsql Studio.
What can I check to understand why this happens?
Thank you.
Hi,
One way is to check the master aggregators tracelog (tracelogs/memsq.log file). It will have more detailed information about the failover. Look for traces similar to:
144740819 2019-10-30 15:28:33.883 INFO: ProcessTransactions Node 10.0.3.171:3306 heartbeat failure summary. Initial heartbeat failure at 2019-10-30 15:28:32. 210 Consecutively Missed heartbeats. Failover was triggered after 200 missed heartbeats
144740830 2019-10-30 15:28:33.883 INFO: ProcessTransactions Heartbeat connection attempts summary:
144740837 2019-10-30 15:28:33.883 INFO: ProcessTransactions Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:32
144740844 2019-10-30 15:28:33.883 INFO: ProcessTransactions Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:33
144740851 2019-10-30 15:28:33.884 INFO: ProcessTransactions Node 10.0.3.171:3306 encountered a heartbeat connection failure (111:Connection refused) at 2019-10-30 15:28:33
144740858 2019-10-30 15:28:33.884 INFO: ProcessTransactions Node 10.0.3.171:3306 heartbeat is currently attempting to reconnect
Check what the leaf that failed was up to at the time of the failover in its tracelog. Did it encounter some problem (was the host healthy, did it run out of disk, did the memsqld process crash for some reason, etc.)
-Adam