HA not working properly

moralez.rodrigo · June 26, 2023, 3:09pm

We have HA enabled, but when we stop one of the server the db stop working.

When the service connects to the second server (the one with the aggregator), it can connect but it triggers this error when the query run:

java.sql.SQLException: Cannot connect to node @memsql-01:3307 with user distributed using password: YES [2004] Cannot connect to ‘memsql-01’:3307. Errno=113 (No route to host) Query: select * from segments Parameters: []

For some reason it still needs the master to run. Is there something we can do. At this point the HA is enabled but it doesn’t makes any sense because it doesn’t work.

We have 2 servers, 1 with a leaf and master agregator and 2 with another leaf and child aggregator

arnaud · June 29, 2023, 7:01pm

It seems like you are facing a connectivity issue. The error message you quoted suggests that your application is unable to reach the Master Aggregator node. It is crucial to ensure that all nodes are consistently reachable within a SingleStore cluster for HA and load-balancing to operate as expected.SingleStore’s High Availability(HA) is designed to tolerate the failure of individual nodes within the topology, but not the failure of the entire Master Aggregator.

The Master Aggregator is a critical component for maintaining and coordinating the cluster state. If you have stopped the Master Aggregator node, your connections will still be disrupted unless you take further steps to recover.

If there is a need to stop or shut down the Master Aggregator for maintenance or other reasons, SingleStore supports promoting a Child Aggregator to be the new Master (Setup for Replication Database Between Cluster). This can be done using the AGGREGATOR SET AS MASTER command. This will maintain availability of the cluster even if the original Master Aggregator is down.

If you experience more issues, I would recommend filing a support ticket.

Kostya · June 29, 2023, 7:01pm

Hello, can you try connecting using the node IP instead of the name, please?
“No route to host” - means that there’s no network connectivity between the client and the host.

michael.arthur · July 6, 2023, 2:29pm

Assuming the outage with master is unplanned, but temporary, would PROMOTE AGGREGATOR … TO MASTER be more appropriate? In the above connectivity example, if the root cause was an incorrectly configured firewall and the team was waiting on the updated rule to propagate vs. redeploying the MAG.

In the case of PROMOTE AGG, will the back up MAG be automatically be demoted back to CAG once the MAG is healthy?

arnaud · July 14, 2023, 6:20pm

HA is on the leaf level (the aggregators do not fail over to each other). If the MA goes offline you will need to connect to a CA, and to make sure we have a consensus leader you would want to promote that to the MA. This is the MA from here on out unless you re-add the old MA and promote is manually.