Leaf Nodes Stuck in RECOVERY_FAILED state

Reposting this from the public chat Slack channel by Manick Mehra who is new to MemSQL

Having trouble with one of the leaf nodes, where it keeps going into state RECOVERY_FAILED… given below are the logs

019-02-11 16:17:58.065 ERROR: Replaying logs/cluster_log_2176: mmap(0, 255496192, PROT_READ, MAP_PRIVATE, 19, 0) failed: 12(Cannot allocate memory)
4819025784270 2019-02-11 16:17:58.065 ERROR: Replaying logs/cluster_log_2176: Thread 105100: HandleAbortAndErrorCodes: Failed to memory-map the file 'logs/cluster_log_2176'.  Replay failed.
4819025784283 2019-02-11 16:17:58.065 ERROR: Replaying logs/cluster_log_2176: File replay for 'logs/cluster_log_2176' failed near offset 255483985!
4819025784294 2019-02-11 16:17:58.065  INFO: Replaying logs/cluster_log_2176: Finishing at offset 255483985 (error)
4819025784307 2019-02-11 16:17:58.065  INFO: Thread 105100: PreTransitionToOffline: `cluster` columnar log: Transition started at term 0.
4819025784326 2019-02-11 16:17:58.066  INFO: Thread 105100: PreTransitionToOffline: `cluster` columnar log: Transition finished with result `Success`.
4819025784336 2019-02-11 16:17:58.066  INFO: Thread 105100: TransitionToOffline: `cluster` columnar log: Transition started at term 0.
4819025813985 2019-02-11 16:17:58.095  INFO: Thread 105100: TransitionToOffline: `cluster` columnar log: Transition finished with result `Success`.
4819025814027 2019-02-11 16:17:58.095  INFO: Thread 105100: PreTransitionToOffline: `cluster` rowstore log: Transition started at term 11.
4819025814043 2019-02-11 16:17:58.095  INFO: Thread 105100: PreTransitionToOffline: `cluster` rowstore log: Transition finished with result `Success`.
4819025814058 2019-02-11 16:17:58.095  INFO: Thread 105100: TransitionToOffline: `cluster` rowstore log: Transition started at term 11.
4819025825152 2019-02-11 16:17:58.106  INFO: Thread 105101: operator(): `cluster` rowstore log: Disconnecting slave at node 38 because 'ReplLog' is resetting.
4819025825185 2019-02-11 16:17:58.106  INFO: Thread 105101: operator(): cluster: Removing slave id 2, node id 38 at barrier number 15.
4819025825198 2019-02-11 16:17:58.106  INFO: Thread 105101: RemoveSlave: `cluster` rowstore log: Removal mentioned above took effect at LSN 0x8800000f3aa.
4819025825207 2019-02-11 16:17:58.106  INFO: Thread 105101: operator(): `cluster` rowstore log: Disconnecting slave at node 37 because 'ReplLog' is resetting.
4819025825221 2019-02-11 16:17:58.106  INFO: Thread 105101: operator(): cluster: Removing slave id 3, node id 37 at barrier number 16.
4819025825230 2019-02-11 16:17:58.106  INFO: Thread 105101: RemoveSlave: `cluster` rowstore log: Removal mentioned above took effect at LSN 0x8800000f3aa.
4819025825239 2019-02-11 16:17:58.106  INFO: Thread 105101: CloseLogFile: `cluster` rowstore log: Flushing log file 'logs/cluster_log_2176' in order to close it.
4819025827029 2019-02-11 16:17:58.108  INFO: Thread 105100: TransitionToOffline: `cluster` rowstore log: Transition finished with result `Success`.
4819029957125 2019-02-11 16:18:02.238  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819034957340 2019-02-11 16:18:07.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819039957365 2019-02-11 16:18:12.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819044957488 2019-02-11 16:18:17.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819049957622 2019-02-11 16:18:22.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819054957780 2019-02-11 16:18:27.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819059957866 2019-02-11 16:18:32.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819064957991 2019-02-11 16:18:37.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819069958112 2019-02-11 16:18:42.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819074958230 2019-02-11 16:18:47.239  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819079958379 2019-02-11 16:18:52.240  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819084958496 2019-02-11 16:18:57.240  INFO: Thread 105112: TakeDatabaseSnapshot: Aborting snapshot for database `cluster`.
4819086949303 2019-02-11 16:18:59.230 ERROR: ProcessNotifications Thread 105035: ConnectToSlave: Could not find cluster database entry, likely because that database reprovisioned.
4819086949340 2019-02-11 16:18:59.231 ERROR: ProcessNotifications Thread 105035: ConnectToSlave: Timed out waiting for cluster db to reprovision.

This command runs indefinitely. Where do I need to look ?

REMOVE LEAF "<DEAD_LEAF>"[:<PORT>];

Thanks Jacky for posting it here, hoping to get a response atleast here.

Hello,

Can you provide following details ?

  1. How many MA,CA and leaf nodes in cluster ?

  2. Please share details like physical Memory capacity for each hosts where MA,CA and leaf nodes deployed.

  3. Are you using memsql-ops to deploy cluster ?

  4. Please share output of following from node where you encounter recover failed error.

    show variables like ‘%memory%’;

Hello,

  1. There is one master and 3 leaf nodes in the cluster.
  2. r4.4xlarge leaf
    m4.4xlarge master
  3. Yes

±---------------------------------------------±-----------------------------------------------------------------------------+
| Variable_name | Value |
±---------------------------------------------±-----------------------------------------------------------------------------+
| maximum_memory | 112631 |
| maximum_table_memory | 101367 |
±---------------------------------------------±-----------------------------------------------------------------------------+

Hello,

Please share details like physical Memory capacity for r4.4xlarge and m4.4xlarge instances.

Thanks
Jaimin

Hello,

Please share details like physical Memory capacity for r4.4xlarge and m4.4xlarge instances.

Thanks

                 VCPU  MEMORY(GB)
      r4.4xlarge	16	122
      m4.4xlarge	16	64

@jshah