We provisioned a MemSQL 6.7.14 cluster using the MemSQL-rpvoded AWS CloudFormation template with eight r4.2xlarge leaf instances (configured for HA) and two m4.large instances acting as aggregators. Without a significant increase in data in the cluster or configuration change, we recently started seeing leaf nodes go offline sporadically (every few hours) after a spurt of messages in their tracelogs like the following:
11150142746 2019-07-08 00:24:09.581 WARN: Failed to allocate 8388608 bytes of memory from the
operating system (Error 12: Cannot allocate memory). This is usually due to a miss-configured operating
system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.
The messages are sometimes followed with log lines like the following (some kind of crash information?):
query: _REPL 0 3002 -1 16 1 0 0
query: _REPL 0 4000 -1 16 1 0 0
Then the memsqld
process appears to restart:
34 2019-07-08 23:22:39.922 INFO: Log opened
01737143 2019-07-08 23:22:41.659 INFO: Initializing OpenSSL
01738123 2019-07-08 23:22:41.660 INFO: MemSQL version hash: fa416b0a536adcfcf95d0607be2d6086a0d58796 (Mon Mar 4 15:00:38 2019 -0500)
...
The memory allocation error usually starts to appear amidst other log messages about replication, but not always.
According to our DataDog monitoring, the usable RAM on the host remains high (~24 GiB, 7 GiB immediately available, plus 17 GiB more after Linux cache flush) at the time of the error. Likewise, querying the information_schema.mv_nodes
table consistently shows lots of headroom between the memory allocated and max memory available for all nodes, e.g. for one affected node:
MAX_MEMORY_MB = 55261
MEMORY_USED_MB = 30751
MAX_TABLE_MEMORY_MB = 49734
TABLE_MEMORY_USED_MB = 24540
The error message makes it sound like the CloudFormation template may have missed some configuration at the OS level at the time of install. I’ve spot-checked the sysctl, hugepage, etc. settings mentioned in SingleStoreDB Cloud · SingleStore Documentation on a few of the impacted hosts and didn’t see anything mismatched.
Has anyone else experienced a similar problem using MemSQL on AWS? Is it possible the OS is misconfigured like the memory warning message says or is that statement a catch-all and red herring?