I have a memsql 5.5 instance with a master and 2 leaf nodes. Every day at 10am, leaf node 2 stops responding to heartbeat ping and gets evicted by master. It comes back up in 2 minutes. Around10 minutes later, leaf node 1 gets evicted and comes back up in a couple of minutes. So, we have an outage every single day from 10am-10:20am.
I do not see any long running queries in memsql ops right before the nodes go down.
The only thing I see is that the leaf nodes show double the memory utilization compared to the master node.
I also noticed that memsql ops suggests that most of the tables in all databses in the cluster need to be analyzed, which I did. But the message in memsql ops doesn’t seem to disappear and it still recommends analyzing.
OS - CentOS release 6.10 (Final)
memsql version - 5.5.8 (Community Edition)
Cluster - Master + 2 leaf nodes.
memsql> SHOW PARTITIONS;
±--------±----------------------------±-----±-------±-------+
| Ordinal | Host | Port | Role | Locked |
±--------±----------------------------±-----±-------±-------+
| 0 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 1 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 2 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 3 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 4 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 5 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
| 6 | memsql50leaf02.test.tsi.lan | 3306 | Master | 0 |
| 7 | memsql50leaf01.test.tsi.lan | 3306 | Master | 0 |
±--------±----------------------------±-----±-------±-------+
8 rows in set (0.00 sec)
memsql>
memsql.cnf:
; ------------------------------------------------------------------------
; THIS CONFIGURATION FILE IS MANAGED BY MEMSQL OPS
; MemSQL Ops controls the data in this file. Please be careful
; when editing it.
; For more information, see our documentation at http://docs.memsql.com
; ------------------------------------------------------------------------
[server]
basedir = .
bind_address = 0.0.0.0
core_file
default_partitions_per_leaf = 4
durability = on
lc_messages_dir = ./share
lock_wait_timeout = 60
max_connections = 100000
snapshot_trigger_size = 1024m
snapshots_to_keep = 2
socket = memsql.sock
ssl_cert = /var/lib/memsql/certs/server-cert.pem
ssl_key = /var/lib/memsql/certs/server-key.pem
tmpdir = .
transaction_buffer = 64m
; ------------------------------------------------------------------------
; MEMSQL OPS VARIABLES
;
; Variables below this header are controlled by MemSQL Ops.
; Please do not edit any of these values directly.
; ------------------------------------------------------------------------
master_aggregator
port = 3306
I am not able to upload the log files from all 3 nodes for the 15 minutes duration. But, this is what I see in the log files:
Master:
57012102350 2019-10-04 10:00:47 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57012631138 2019-10-04 10:00:47 INFO: ReplicationMaster[sharding]: Database ‘sharding’: disconnecting from sync slave 172.16.10.218:3306/sharding
57013102437 2019-10-04 10:00:48 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57014102502 2019-10-04 10:00:49 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 2004:Cannot connect to ‘memsql50leaf02.test.tsi.lan’:3306. Errno=111 (Connection refused))
57015102435 2019-10-04 10:00:50 ERROR: Leaf memsql50leaf02.test.tsi.lan:3306 has failed. Trying to remove it from the cluster.
57015102501 2019-10-04 10:00:50 WARN: Leaf memsql50leaf02.test.tsi.lan:3306 has not responded to the heartbeat (error 0:failed to establish connection)
57015188002 2019-10-04 10:00:50 INFO: Flushed 0 connections to ‘memsql50leaf02.test.tsi.lan’:3306
.
.
.
Leaf 01 and 02:
57094223350 2019-10-04 10:00:50 INFO: Replaying logs/sharding_log_6: Flushed 0 connections to ‘memsql50leaf02.test.tsi.lan’:3306
.
.
DropInstance()
.
.
Replaying new instance
.
.
Shutting down MemSQL
Unloading all databases / Pausing sharding
database
.
.
memsql started
What else can I check? Let me know if there are more relevant information I can provide here.
We have builds failing every day and any help will be highly appreciated. Thank you!