Disk Full on All Leaf Nodes

Novian · August 24, 2022, 9:35am

Hi,
We have a cluster consisting of two Aggregators (Master and Child) and six leaf nodes. But now all leaf nodes are offline because the disk is 100% full. Is there a way we can do to clean the disk even if it means deleting the database or datadir from all leaf nodes?

Our goal is to free up the capacity of all leaf nodes so that the cluster can be reused. Can anyone help us?

sambeatty · August 25, 2022, 12:51am

Hi Novian,

If you’re a current customer then you should open a support ticket with either a cluster report (if your /tmp mounts have enough space to collect it - if you need to use another mount like /run from your screenshot then you can specify it with the --temp-dir flag on the sdb-report collect command) or with recent logging and either du or recursive ls output of your leaf node data dirs along with any additional context you believe might help (large data load, other applications running on the cluster’s hosts, etc).

The support team has put together a fairly exhaustive article on SingleStore disk utilization here:

The short story, though, will be that you typically don’t want to touch anything in the data directory unless instructed to by SingleStore support or if it’s a core file from a crash that you’re not sending for a bug report. You can, however, clear the plancache directory to get back space and, to a lesser extent, the tracelogs directory (though that shouldn’t take up as much space). You should only do these steps when the node processes are stopped however - In your screenshots that isn’t currently the case. You can stop the cluster with sdb-admin stop-node --all or by sshing to each host and killing all memsqld processes.

For the future the minimal_disk_space variable will stop SingleStore from writing to disk beyond a certain threshold. On your version the default should be 100MB - I believe it was raised in later versions - and it can be set lower while you free up space then set back afterwards. Alternatively you may want to consider using fallocate to make a large (say 25GB) junk/throwaway file on the mount your SingleStore nodes run off of to act as a safety valve for scenarios like this.