I have a cluster running on AKS using the memsql-operator. Frequently, my connections to the -ddl LoadBalancer endpoint hang; the only solution I’ve found is to delete the LB and let the operator recreate it. Any idea what’s up?
Here’s an example:
azureuser@rowagnss03:~$ kubectl get services -nmemsql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc-memsql-cluster ClusterIP None <none> 3306/TCP 168m
svc-memsql-cluster-ddl LoadBalancer 10.0.116.9 20.83.142.39 3306:31175/TCP 110m
svc-memsql-cluster-dml LoadBalancer 10.0.171.170 20.84.11.38 3306:30645/TCP 167m
azureuser@rowagnss03:~$ kubectl describe service svc-memsql-cluster-ddl -nmemsql
Name: svc-memsql-cluster-ddl
Namespace: memsql
Labels: app.kubernetes.io/component=master
app.kubernetes.io/instance=memsql-cluster
app.kubernetes.io/name=memsql-cluster
Annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: 4000
Selector: app.kubernetes.io/instance=memsql-cluster,app.kubernetes.io/name=memsql-cluster,statefulset.kubernetes.io/pod-name=node-memsql-cluster-master-0
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.0.116.9
IPs: 10.0.116.9
LoadBalancer Ingress: 20.83.142.39
Port: memsql 3306/TCP
TargetPort: 3306/TCP
NodePort: memsql 31175/TCP
Endpoints: 172.19.1.63:3306
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
azureuser@rowagnss03:~$ mysql -uadmin -pxxx -h20.83.142.39
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2003 (HY000): Can't connect to MySQL server on '20.83.142.39:3306' (110)
azureuser@rowagnss03:~$ kubectl delete service svc-memsql-cluster-ddl -nmemsql
service "svc-memsql-cluster-ddl" deleted
azureuser@rowagnss03:~$ kubectl get services -nmemsql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc-memsql-cluster ClusterIP None <none> 3306/TCP 179m
svc-memsql-cluster-ddl LoadBalancer 10.0.209.198 52.226.255.180 3306:30074/TCP 102s
svc-memsql-cluster-dml LoadBalancer 10.0.171.170 20.84.11.38 3306:30645/TCP 177m
azureuser@rowagnss03:~$ mysql -uadmin -pxxx -h52.226.255.180
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 11841
Server version: 5.7.32 MemSQL source distribution (compatible; MySQL Enterprise & MySQL Commercial)
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
Have you looked at any of the Azure or Kubernetes logs to view the health of the AKS network?
The S2 Operator and DB resources run within Kubernetes, it’s the function of the Kubernetes scheduler and Kubernetes cluster to maintain the health of all of the resources and components that are spun up, this includes the DDL and DML services. Other than the S2 Operator defining the service spec there isn’t anything additional that the Operator can directly do with the service resources – all work is done via the Kubernetes scheduler/fabric underlying supporting infrastructure.
If there are problems not directly related to the S2 Operator or S2 PODs, etc. (you can see if this is the case by checking S2 tracelogs for example in their running pods) then to find the root cause you will have to take a look at the Azure and AKS logging and monitoring.
I compared your describe output from the ddl service output from one of our clusters with a ddl endpoint that has been running in AKS for 574d and another that has been running for 216d and I see no differences of note other than the name.
One item to consider, is your AKS CNI configured for Kubenet or the Azure CNI?
Hey cynn - I am using CNI. The problem is currently occurring (i.e., telnet to the -ddl endpoint at port 3306 hangs), but the operator log is clean:
azureuser@rowagnss03:~$ kubectl logs memsql-operator-69f49b796d-zns45 -nmemsql | tail -n 14
2022/04/19 15:37:21 errors.go:64 {controller.memsql} Reconciler success will retry after: “5m0s”
2022/04/19 15:42:21 controller.go:183 {controller.memsql} Reconciling MemSQL Cluster. Request.Namespace: “memsql” Request.Name: “memsql-cluster”
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 clustering_master.go:23 {memsql} Ensure Master Aggregator is setup
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 clustering.go:91 {memsql} Gathering StatefulSet Info
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-leaf-ag1 are added to the MA
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-leaf-ag2 are added to the MA
2022/04/19 15:42:21 clustering.go:120 {memsql} Ensuring all node-memsql-cluster-aggregator are added to the MA
2022/04/19 15:42:21 clustering.go:166 {memsql} Ensuring all Leaves are updated to the latest statefulset changes
2022/04/19 15:42:21 clustering.go:819 {memsql} In paired mode, will do rebalance if leaf node removed
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 connection.go:38 {memsql} Connect to the Master Aggregator
2022/04/19 15:42:21 errors.go:64 {controller.memsql} Reconciler success will retry after: “5m0s”
But the tail of /var/lib/memsql/instance/tracelogs/memsql.log on the master aggregator contains a bunch of:
4782400813 2022-04-19 15:56:20.468 INFO: AZURE: HTTP status code 201 won’t be retried.