Frequent LoadBalancer failures

I have a cluster running on AKS using the memsql-operator. Frequently, my connections to the -ddl LoadBalancer endpoint hang; the only solution I’ve found is to delete the LB and let the operator recreate it. Any idea what’s up?

Here’s an example:

azureuser@rowagnss03:~$ kubectl get services -nmemsql
NAME                     TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)          AGE
svc-memsql-cluster       ClusterIP      None           <none>         3306/TCP         168m
svc-memsql-cluster-ddl   LoadBalancer   10.0.116.9     20.83.142.39   3306:31175/TCP   110m
svc-memsql-cluster-dml   LoadBalancer   10.0.171.170   20.84.11.38    3306:30645/TCP   167m
azureuser@rowagnss03:~$ kubectl describe service svc-memsql-cluster-ddl -nmemsql
Name:                     svc-memsql-cluster-ddl
Namespace:                memsql
Labels:                   app.kubernetes.io/component=master
                          app.kubernetes.io/instance=memsql-cluster
                          app.kubernetes.io/name=memsql-cluster
Annotations:              service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: 4000
Selector:                 app.kubernetes.io/instance=memsql-cluster,app.kubernetes.io/name=memsql-cluster,statefulset.kubernetes.io/pod-name=node-memsql-cluster-master-0
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.0.116.9
IPs:                      10.0.116.9
LoadBalancer Ingress:     20.83.142.39
Port:                     memsql  3306/TCP
TargetPort:               3306/TCP
NodePort:                 memsql  31175/TCP
Endpoints:                172.19.1.63:3306
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>
azureuser@rowagnss03:~$ mysql -uadmin -pxxx -h20.83.142.39
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2003 (HY000): Can't connect to MySQL server on '20.83.142.39:3306' (110)
azureuser@rowagnss03:~$ kubectl delete service svc-memsql-cluster-ddl -nmemsql
service "svc-memsql-cluster-ddl" deleted
azureuser@rowagnss03:~$ kubectl get services -nmemsql
NAME                     TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)          AGE
svc-memsql-cluster       ClusterIP      None           <none>           3306/TCP         179m
svc-memsql-cluster-ddl   LoadBalancer   10.0.209.198   52.226.255.180   3306:30074/TCP   102s
svc-memsql-cluster-dml   LoadBalancer   10.0.171.170   20.84.11.38      3306:30645/TCP   177m
azureuser@rowagnss03:~$ mysql -uadmin -pxxx -h52.226.255.180
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 11841
Server version: 5.7.32 MemSQL source distribution (compatible; MySQL Enterprise & MySQL Commercial)

Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Have you looked at any of the Azure or Kubernetes logs to view the health of the AKS network?

The S2 Operator and DB resources run within Kubernetes, it’s the function of the Kubernetes scheduler and Kubernetes cluster to maintain the health of all of the resources and components that are spun up, this includes the DDL and DML services. Other than the S2 Operator defining the service spec there isn’t anything additional that the Operator can directly do with the service resources – all work is done via the Kubernetes scheduler/fabric underlying supporting infrastructure.

If there are problems not directly related to the S2 Operator or S2 PODs, etc. (you can see if this is the case by checking S2 tracelogs for example in their running pods) then to find the root cause you will have to take a look at the Azure and AKS logging and monitoring.

I compared your describe output from the ddl service output from one of our clusters with a ddl endpoint that has been running in AKS for 574d and another that has been running for 216d and I see no differences of note other than the name.

One item to consider, is your AKS CNI configured for Kubenet or the Azure CNI?

For Azure, we only recommend using the Azure CNI.

Hey cynn - I am using CNI. The problem is currently occurring (i.e., telnet to the -ddl endpoint at port 3306 hangs), but the operator log is clean:

azureuser@rowagnss03:~$ kubectl logs memsql-operator-69f49b796d-zns45 -nmemsql | tail -n 14
2022/04/19 15:37:21 errors.go:64 {controller.memsql} Reconciler success will retry after: “5m0s”
2022/04/19 15:42:21 controller.go:183 {controller.memsql} Reconciling MemSQL Cluster. Request.Namespace: “memsql” Request.Name: “memsql-cluster”
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 clustering_master.go:23 {memsql} Ensure Master Aggregator is setup
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 clustering.go:91 {memsql} Gathering StatefulSet Info
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-leaf-ag1 are added to the MA
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-leaf-ag2 are added to the MA
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-aggregator are added to the MA
2022/04/19 15:42:21 clustering.go:166 {memsql} Ensuring all Leaves are updated to the latest statefulset changes
2022/04/19 15:42:21 clustering.go:819 {memsql} In paired mode, will do rebalance if leaf node removed
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 errors.go:64 {controller.memsql} Reconciler success will retry after: “5m0s”

And the service thinks it’s happy:
azureuser@rowagnss03:~$ kubectl describe service svc-memsql-cluster-ddl -nmemsql
Name: svc-memsql-cluster-ddl
Namespace: memsql
Labels: app.kubernetes.io/component=master
app.kubernetes.io/instance=memsql-cluster
app.kubernetes.io/name=memsql-cluster
Annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: 4000
Selector: app.kubernetes.io/instance=memsql-cluster,app.kubernetes.io/name=memsql-cluster,statefulset.kubernetes.io/pod-name=node-memsql-cluster-master-0
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.0.181.2
IPs: 10.0.181.2
IP: 20.84.28.215
LoadBalancer Ingress: 20.84.28.215
Port: memsql 3306/TCP
TargetPort: 3306/TCP
NodePort: memsql 32043/TCP
Endpoints: 172.19.0.252:3306
Session Affinity: None
External Traffic Policy: Cluster
Events:

But the tail of /var/lib/memsql/instance/tracelogs/memsql.log on the master aggregator contains a bunch of:
4782400813 2022-04-19 15:56:20.468 INFO: AZURE: HTTP status code 201 won’t be retried.

Does that mean anything to you?

Hi rowagn, can you capture the log output from the Master Aggregator pod and DM me your memsql CR for review?

Sure thing. I’ll DM you the following:

azureuser@rowagnss03:~$ kubectl logs node-memsql-cluster-master-0 -c node -nmemsql > master_logs.txt
azureuser@rowagnss03:~$ kubectl get MemsqlCluster -nmemsql -o=json > MemsqlCluster.json