It’s happening very ofter that when we run heavy queries the database (and the server) simply crashes. I had to restart the server and after recovering, it works again.
I set a resource pool for all the users with max of 70% of memory, but it still happening, timeout of 600 seconds, 5 concurrency and 15 queue queries.
These are the last logs that I see in the master:
FROM view_query’ submitted 419 milliseconds ago, queued for 16 milliseconds, compiled asynchronously in 403 milliseconds
6754790299059 2020-12-10 22:43:45.907 WARN: socket (153) ETIMEDOUT in recv
6754791258972 2020-12-10 22:43:46.867 INFO: Background Statistics Thread: Writing stats
6754792347109 2020-12-10 22:43:47.955 WARN: socket (145) ETIMEDOUT in recv
6754798491146 2020-12-10 22:43:54.099 WARN: socket (149) ETIMEDOUT in recv
6754806687082 2020-12-10 22:44:02.295 WARN: socket (152) ETIMEDOUT in recv
6754818971077 2020-12-10 22:44:14.579 WARN: socket (155) ETIMEDOUT in recv
6754921371092 2020-12-10 22:45:56.979 WARN: socket (108) ETIMEDOUT in recv
6754935707103 2020-12-10 22:46:11.315 WARN: socket (101) ETIMEDOUT in recv
I don’t have an answer for you off the top of my head. If you’re a paying customer, opening a support case might be in order. I’ll ask to see if someone who knows more can help.
We’re up to 7.1.13 now, so you might try upgrading. But I don’t have specific knowledge of any bug fixes that would impact this.
If the server crashed you should see crash reporting outpout in the tracelog (it will dump out a callstack of the crash) - I don’t see that in the snippet you pasted. If you don’t see that output it likely wasn’t a crash. The most common reason for the process to die without a crash is when linux OOM killers kills it due to not having memory limits configured properly (SingleStoreDB Cloud · SingleStore Documentation).
Thank you, I think I got it. I checked the kernel logs and you’re right… it’s an OOM issue.
I think I’m going to reduce the maximum_memory global parameter. Currently, it’s 90% of the server RAM, but I will reduce it a little more.
Dec 10 21:46:56 memsql kernel: [9248475.324923] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/memsql.service,task=memsqld,pid=1359,uid=112
Dec 10 21:46:56 memsql kernel: [9248475.326620] Out of memory: Killed process 1359 (memsqld) total-vm:30179084kB, anon-rss:25677544kB, file-rss:0kB, shmem-rss:0kB
Another is to ensure you have enough swap space setup. Linux is much more trigger happy with OOM kills if there is not enough swap (we recommend 10-20% of physical memory).