Dear MemSQL team,
we have a cluster 6.5.19 with 6 aggregators and 20 leafs each with 64GB RAM. We add 5 new leafs because we were ruuning out of memory. After I added new leaves to the cluster using memsql-ops UI, I manually copied some partitions from the current 20 leaves to the new ones. Immediately after COPY PARTITION, I promote the new partition on the new leaf and drop the old one. Normally every leaf have about 250 partitions and the new ones sometimes have as little as 50 and what happens is that the new leaf (any of those new ones) are unresponsive. Memsql-ops said there are in unknown state and I can not even ping that new leaf. Only thing I can do is the hard reset of the whole server. This is what is in the logs:
12618508212 2020-04-01 12:38:10.704  WARN: socket (908) ETIMEDOUT in poll
12621514187 2020-04-01 12:38:13.710  WARN: socket (2973) ETIMEDOUT in poll
12621514195 2020-04-01 12:38:13.710  WARN: socket (1789) ETIMEDOUT in poll
12624520143 2020-04-01 12:38:16.716  WARN: socket (1803) ETIMEDOUT in poll
12624520213 2020-04-01 12:38:16.716  WARN: socket (2477) ETIMEDOUT in poll
12625025963 2020-04-01 12:38:17.222  WARN: socket (1342) ETIMEDOUT in poll
12627526315 2020-04-01 12:38:19.722  WARN: socket (4140) ETIMEDOUT in send
12627526343 2020-04-01 12:38:19.722  WARN: socket (1401) ETIMEDOUT in recv
12627526509 2020-04-01 12:38:19.723  WARN: socket (1849) ETIMEDOUT in recv
12627526519 2020-04-01 12:38:19.723  WARN: socket (1383) ETIMEDOUT in recv
12627526528 2020-04-01 12:38:19.723  WARN: socket (2500) ETIMEDOUT in recv
12627526538 2020-04-01 12:38:19.723  WARN: socket (2053) ETIMEDOUT in recv
12627526547 2020-04-01 12:38:19.723  WARN: socket (1878) ETIMEDOUT in recv
Sometimes it happens shortly after the COPING, PROMOTING and DROPING the partitions and sometimes it happens hours after that. When I reset the server after attaching that leaf everything runs ok for a few hours and after that it happens again and again and again untill I copy the partitions back from the new leaves.
The new leaves are the exact same configuration as the other leaves. I tried change timeout, sysctl variables, nothing really helps. MemSQL just floods all the connection sockets making the whole server unreachable.
What can we try or why is it happening only on the new servers? Any help would be really appreciated. Thank you!