Hi, We’re using a singlestore 7.3.3 installation on kubernetes via the operator. It runs most of the time, but every so often we lose an environment. I had the chance to investigate one such situation today and all I could find from the pod was that the readiness-probe had failed (master node).
The node was up, I could exec bash and manually run the readiness-probe script. I could also run the command that the script runs (memsqlctl query --sql “SHOW DATABASES EXTENDED”) and it returned information fine. While “trying stuff” I ran a memsqlctl restart-node which then restarted the pod and things came back up, only to go down again a bit later. This has continued and I’m really not sure what more information to try to get.
Is there any way to get more information about why the readiness-probe failed? All I can see from the kubectl get events is that it failed.
Is there anywhere else I can look on the node either via bash or within kubectl? I’m tempted to try turning off the probe as it doesn’t feel like the node is actually down.
Sorry you are hitting this issues. When you say that the you are losing an environment, could you clarify what you mean? Are you unable to connect to the DDL service or are there actually pods that are getting deleted or crashing?
Readiness probe failures should not cause a pod to be rebooted or deleted, it just makes the pod unreachable by services; only a liveness probe failure can cause a pod reboot and the liveness probe is only checking that the process is still running.
The first place i would look is in the operator logs to see where it is failing and what the operator is trying to do. From there I would check the MA database logs to see what is going on. Additionally it is worth looking at the MA pod and statefulset status and events if there is more information.
As for the lack of log messages in the readiness probe, if a readiness probe bash command fails directly, it should have a clear message. If the commands fail but the node is in a bad state, the script will log an error message before exiting. i think the most probable case is the memsqlctl query commands themselves are hanging. If those commands hang, then the readiness probe will fail when the probe times out and the script itself won’t emit much of a message.
To clarify too, which operator version are you running?
Sorry, I should have been clearer on that. By losing an environment I mean we cannot connect to ddl, which for us is effectively “down”, but the node itself is still up (I can exec bash on it and run things). I assume the liveness probe is therefore still succeeding.
As for logs - where can I find the operator logs and MA database logs? When I view the std output of the pod it doesn’t have any noticeable error message. Will the readiness probe failure also be written to those logs?
It looks like that is relatively old operator image. Looking at the code, in operator versions before 2.0.1 we have a very low default value for the Readiness probe TimeoutSeconds, at 1 second. This was subsequently bumped to a more reasonable default value of 10 seconds.
If the readiness probe script is unable to complete before the timeout than it is treated as unready and after enough failures beyonde the probe’s failureThreshold, the MA pod will no longer be accessible by the service.
Are you setting that value explicitly or is it running with the default value?
I would recommend either updating to a newer operator version which will have a higher default value or explicitly setting the readiness probe to a higher value inside the spec. This is documented here SingleStoreDB Cloud · SingleStore Documentation
Also, to answer you question, to get the operator or ma logs you can just do a kubectl logs <pod_name> for the operator pod and the MA pod. This is the stdout of the process in the container itself not the stdout of the scripts run by the readiness probes. The stdout of the readiness probes is in the k8s events associated with the pod, you’ll be able to see those with kubectl describe pod <pod_name>
Thanks for your reply, I discovered the 1 second readiness probe timeout during my investigations yesterday and have bumped it to 10 seconds. Luckily it appears from the 1.2.4 release notes that this was when they exposed the ability to do this. We’ll look at moving to the latest version, but for now we’ll have to control the change and just increase that.
The std out in the kubectl logs from the pods and the event in the describe provide no information, which would support the idea that this is a timeout. I’ll see how things go with the 10 second timeout.