Failed to allocate memory, "miss-configred operating system or VM"

parente · July 9, 2019, 12:56am

We provisioned a MemSQL 6.7.14 cluster using the MemSQL-rpvoded AWS CloudFormation template with eight r4.2xlarge leaf instances (configured for HA) and two m4.large instances acting as aggregators. Without a significant increase in data in the cluster or configuration change, we recently started seeing leaf nodes go offline sporadically (every few hours) after a spurt of messages in their tracelogs like the following:

11150142746 2019-07-08 00:24:09.581   WARN: Failed to allocate 8388608 bytes of memory from the
 operating system (Error 12: Cannot allocate memory). This is usually due to a miss-configured operating
 system or virtualization technology. See https://docs.memsql.com/troubleshooting/latest/memory-errors.

The messages are sometimes followed with log lines like the following (some kind of crash information?):

query: _REPL 0 3002 -1 16 1 0 0
query: _REPL 0 4000 -1 16 1 0 0

Then the memsqld process appears to restart:

34 2019-07-08 23:22:39.922 INFO: Log opened
01737143 2019-07-08 23:22:41.659   INFO: Initializing OpenSSL
01738123 2019-07-08 23:22:41.660   INFO: MemSQL version hash: fa416b0a536adcfcf95d0607be2d6086a0d58796 (Mon Mar 4 15:00:38 2019 -0500)
...

The memory allocation error usually starts to appear amidst other log messages about replication, but not always.

According to our DataDog monitoring, the usable RAM on the host remains high (~24 GiB, 7 GiB immediately available, plus 17 GiB more after Linux cache flush) at the time of the error. Likewise, querying the information_schema.mv_nodes table consistently shows lots of headroom between the memory allocated and max memory available for all nodes, e.g. for one affected node:

MAX_MEMORY_MB        = 55261
MEMORY_USED_MB       = 30751
MAX_TABLE_MEMORY_MB  = 49734
TABLE_MEMORY_USED_MB = 24540

The error message makes it sound like the CloudFormation template may have missed some configuration at the OS level at the time of install. I’ve spot-checked the sysctl, hugepage, etc. settings mentioned in SingleStoreDB Cloud · SingleStore Documentation on a few of the impacted hosts and didn’t see anything mismatched.

Has anyone else experienced a similar problem using MemSQL on AWS? Is it possible the OS is misconfigured like the memory warning message says or is that statement a catch-all and red herring?

adam · July 9, 2019, 2:44am

That error means Linux refused a memory allocation request (and MemSQLs memory use was under the maximum_memory variable at the time. You’ll get a different error if MemSQL is refusing to allocate memory because its memory use has reached maximum_memory).

The first thing to check is run memsql-report check --all as a sanity check on the configuration.
(SingleStoreDB Cloud · SingleStore Documentation)

If things look okay in that output then generate a cluster report (memsql-report collect) and send it to bug-report@memsql.com
(SingleStoreDB Cloud · SingleStore Documentation)

parente · July 9, 2019, 12:11pm

Thanks @adam. I collected a report on one of the affected nodes right after a crash like the one I described with:

memsql-report collect-local --all

I then ran a check on the report. (The memsql-report tool tells me I have to specify a path to report file when running check. ¯\_(ツ)_/¯)

memsql-report check --all --report-path ./report-2019-07-09T120753.tar.gz

Every line in the check output indicates a PASS. I’ll file a ticket with support as you suggest.

(Ticket link: https://support.memsql.com/hc/en-us/requests/8622)

parente · July 9, 2019, 12:20pm

I should also mention …

I noticed there are core dump files on each of the impacted leaf nodes with timestamps coinciding with the times they went offline. I’ll make sure I mention this fact in the ticket, but wanted to also include that information here in public for posterity.

adam · July 18, 2019, 4:29am

Following up here. Our cloud formation template doesn’t set up any swap. MemSQL won’t use swap in its default configuration, but the lack of swap makes linux more aggressive about failing memory allocations (and triggering the linux OOM killer : SingleStoreDB Cloud · SingleStore Documentation) if memory use gets high. MemSQL in its default config limits itself to 90% of the machines physical memory - which is an aggressive enough configuration that swap is required. We are looking into addressing this issue in our cloud formation template.