I have self-hosted two aggregators and two leaves which are in replication mode. I get this error a lot and have to restart memsql on the one of the leaves constantly. ERROR 1735 ER_CANNOT_CONNECT_TO_LEAF: Leaf Error (everclear4.XXXXXXXXXX:3306): Cannot connect to node @everclear4.XXXXXXXX:3306 with user distributed using password: YES [1045] Access denied for user ‘distributed’@‘IPADDRESS’ (using password: YES). The remedy is to restart memsql.service on that leaf.
I backed up table structure, dropped all the tables and this leaf, and then re-created it with sdb-admin add-leaf and still have the same symptoms.
I asked SQrL about this and this was the response:
There could be several potential causes for an
ER_CANNOT_CONNECT_TO_LEAF
error1. The error indicates that an aggregator node is unable to connect to a leaf node2.One likely scenario is that when leaves connect to each other, they cache connection information (e.g. leaf-1 is at IP x.x.x.x, leaf-2is at IP y.y.y.y, etc.). If the IPs of these leaves ever change the cache will not automatically update. This will ultimately result in an unsuccessful connection attempt because the other leaves in the cluster are using outdated IP address information1.
To resolve this issue, you can flush the DNS cache and connection pools on all affected nodes with the following SQL commands1:
sql
FLUSH HOSTS; FLUSH CONNECTION POOLS;
FLUSH HOSTS
clears the DNS cache on the node, andFLUSH CONNECTION POOLS
shuts down all existing connections and closes idle pooled connections1. Run these commands on all the affected nodes.Ensure that there is network connectivity between the aggregator and leaf nodes and that no firewalls or security group settings are blocking the connections1.
If the problem persists, it may be due to a capacity limit hit. As CPU maxes out, queries will start taking longer to complete as the kernel queues them. You need to tune the
max_connection_threads
to find the correct balance between the count of concurrently running queries and resource utilization. As you increase the max threads per aggregator, you should monitor the resource usage (CPU, load average, memory) during a typical workload3.These are some general remedies, however, specific problems may require specific solutions. So, you might need to evaluate your system’s logs and conditions more thoroughly1.
This error might also occur in SingleStoreDB in the managed service when leaves cannot connect to S3, for instance, due to a ‘dial tcp i/o timeout’ error. This might require checking firewall and network routing configurations4.
If you’re constantly needing to restart a specific leaf in your cluster, though, and the problem is recurring, it’s also possible that there might be a larger issue at hand, and you might want to discuss this with the SingleStoreDB support team.
Make certain to follow these steps and perform these analysis procedures in a systematic manner.
That is an interesting suggestion, no IP addresses have changed at all but it is relying on DNS records to resolve the node names to IPs. My question for the forum is, how can I check to see the records of the DNS cache before flushing them out to see what is going on and further troubleshoot this problem?
Does anyone have any other suggestions about why this could be happening? The error does not show up very often on the forum at all and even when it does its happening in my cluster very frequently not like once a day or something. Thanks.