Memsql leaf nodes are evicted every day at 10am and autorecover in 15 minutes

vrk · October 9, 2019, 6:34pm

I have a memsql 5.5 instance with a master and 2 leaf nodes. Every day at 10am, leaf node 2 stops responding to heartbeat ping and gets evicted by master. It comes back up in 2 minutes. Around10 minutes later, leaf node 1 gets evicted and comes back up in a couple of minutes. So, we have an outage every single day from 10am-10:20am.

I do not see any long running queries in memsql ops right before the nodes go down.

The only thing I see is that the leaf nodes show double the memory utilization compared to the master node.

I also noticed that memsql ops suggests that most of the tables in all databses in the cluster need to be analyzed, which I did. But the message in memsql ops doesn’t seem to disappear and it still recommends analyzing.

OS - CentOS release 6.10 (Final)
memsql version - 5.5.8 (Community Edition)
Cluster - Master + 2 leaf nodes.

memsql> SHOW PARTITIONS;
±--------±----------------------------±-----±-------±-------+
| Ordinal | Host | Port | Role | Locked |
±--------±----------------------------±-----±-------±-------+
| 0 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 1 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 2 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 3 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 4 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 5 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 6 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 7 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
±--------±----------------------------±-----±-------±-------+
8 rows in set (0.00 sec)

memsql>

memsql.cnf:
; ------------------------------------------------------------------------
; THIS CONFIGURATION FILE IS MANAGED BY MEMSQL OPS
; MemSQL Ops controls the data in this file. Please be careful
; when editing it.
; For more information, see our documentation at http://docs.memsql.com
; ------------------------------------------------------------------------
[server]
basedir = .
bind_address = 0.0.0.0
core_file
default_partitions_per_leaf = 4
durability = on
lc_messages_dir = ./share
lock_wait_timeout = 60
max_connections = 100000
snapshot_trigger_size = 1024m
snapshots_to_keep = 2
socket = memsql.sock
ssl_cert = /var/lib/memsql/certs/server-cert.pem
ssl_key = /var/lib/memsql/certs/server-key.pem
tmpdir = .
transaction_buffer = 64m
; ------------------------------------------------------------------------
; MEMSQL OPS VARIABLES
;
; Variables below this header are controlled by MemSQL Ops.
; Please do not edit any of these values directly.
; ------------------------------------------------------------------------
master_aggregator
port = 3306

I am not able to upload the log files from all 3 nodes for the 15 minutes duration. But, this is what I see in the log files:
Master:
57012102350 2019-10-04 10:00:47 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57012631138 2019-10-04 10:00:47 INFO: ReplicationMaster[sharding]: Database ‘sharding’: disconnecting from sync slave 172.16.10.218:3306/sharding
57013102437 2019-10-04 10:00:48 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57014102502 2019-10-04 10:00:49 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57015102435 2019-10-04 10:00:50 ERROR: Leaf memsql50leaf02.test.tsi.lan:3306 has failed. Trying to remove it from the cluster.
57015102501 2019-10-04 10:00:50 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 0:failed to establish connection)
57015188002 2019-10-04 10:00:50 INFO: Flushed 0 connections to ‘memsql50leaf02.test.tsi.lan’:3306
.
.
.

Leaf 01 and 02:
57094223350 2019-10-04 10:00:50 INFO: Replaying logs/sharding_log_6: Flushed 0 connections to ‘memsql50leaf02.test.tsi.lan’:3306
.
.
DropInstance()
.
.
Replaying new instance
.
.
Shutting down MemSQL
Unloading all databases / Pausing sharding database
.
.
memsql started

What else can I check? Let me know if there are more relevant information I can provide here.
We have builds failing every day and any help will be highly appreciated. Thank you!

rodrigo · October 9, 2019, 8:55pm

Hi r_vas

First, I would highly recommend upgrading MemSQL - 5.5 is no longer supported, and there are a lot of improvements in query performance, and cluster management features in the newer MemSQL versions, 6.5 and 6.7

That will, however, not solve your problem - the trace message you pasted in, “Shutting down MemSQL” only shows up when the memsqld process receives an external signal to shut it down (e.g. someone runs pkill memsqld or kill $MEMSQLD_PROC_NUM).

Given the periodicity (every day at 10AM), I would suspect a rogue cron job is restarting memsqld for unknown reasons. You can find more info on how to do find scheduled cron jobs here: How to Display (List) All Jobs in Cron / Crontab If you find a job killing memsqld then just remove it from cron, and the outage should stop.