sunil
August 18, 2020, 5:52pm
1
Hello,
We are running memsql 6.8(128 GB free license) with a heterogeneous workload and we see the leaves occasionally exceeding the provisioned memory limit leading to query errors. None of the tables are using a lot of memory and profiling hasn’t revealed any bad actors.
memsql-report is showing the following errors:
FAIL Found 14999 'buffer manager memory allocation failure' errors in last 7 days for 172.40.21.220:3306 (C99E9EBDD2)
FAIL Found 14350 'buffer manager memory allocation failure' errors in last 7 days for 172.40.21.239:3306 (C99E9EBDD2)
FAIL Malloc_active_memory too high on node C99E9EBDD25CFC195BDDD65DB0D7551E7BA3CA0E (5.16 GB)
FAIL Malloc_active_memory too high on node 01D22BEB264B9C8991927CCFDD2027E51552F3B4 (5.29 GB)
From show status extended the below component looks large on both the leaves…
Alloc_durability_large 22101.626 MB
Please let me know how we can go about debugging this further
Thanks,
Sunil
adam
August 19, 2020, 6:50pm
2
Hi Sunil,
Its likely those errors are from queries which failed with out of memory errors.
Can you post “show status extended” from one of your leaves that hit that error?
Alloc_durabilitly_large is part of the MemSQL 6.X durability code (buffer used to commit transactions). You can decrease it by decreasing the transaction_buffer system variable. Its a startup only variable, so will need to be set in memsql.cnf file on each node - tools or ops can do that for you SingleStoreDB Cloud · SingleStore Documentation ). MemSQL 7.X no longer uses fixed sized buffers to commit transactions, so this memory use will drop when you upgrade as well.
-Adam
sunil
August 19, 2020, 7:53pm
3
Hello Adam,
Thanks a lot for the response.
Below is the show status extended
from the report collected at the time of the problem.
Please let me know if anything jumps out of this info.
Once again, thanks a lot for your time.
+-----------------------------------------------------------+--------------------------------------------------------------+
| Variable_name | Value |
+-----------------------------------------------------------+--------------------------------------------------------------+
| Aborted_clients | 9121 |
| Aborted_connects | 2 |
| Bytes_received | 2585487428551 |
| Bytes_sent | 3049540944929 |
| Connections | 12890 |
| Max_used_connections | 3445 |
| Queries | 1185509648 |
| Questions | 1185509648 |
| Threads_cached | 872 |
| Threads_connected | 2740 |
| Threads_created | 1814 |
| Threads_running | 361 |
| Threads_background | 345 |
| Threads_shutdown | 40302 |
| Threads_idle | 1798 |
| Ready_queue | 0 |
| Idle_queue | 0 |
| Context_switches | 573054295 |
| Context_switch_misses | 532845 |
| Workload_management_queued_queries | 0 |
| Workload_management_active_queries | 0 |
| Workload_management_active_threads | 0 |
| Workload_management_active_connections | 0 |
| Columnstore_ingest_management_queued_queries | 0 |
| Columnstore_ingest_management_active_queries | 0 |
| Columnstore_ingest_management_estimated_segments_to_flush | 0 |
| Columnstore_ingest_management_estimated_memory | 109.922 (+0.946) MB |
| Uptime | 948689 |
| Prepared_stmt_count | 0 |
| Auto_attach_remaining_seconds | 0 |
| Data_directory | /var/lib/memsql/8d1e9e0b-4bba-40fd-b1ea-8f9d8dd58f86/data |
| Plancache_directory | /var/lib/memsql/8d1e9e0b-4bba-40fd-b1ea-8f9d8dd58f86/plan... |
| Transaction_logs_directory | /var/lib/memsql/8d1e9e0b-4bba-40fd-b1ea-8f9d8dd58f86/data... |
| Segments_directory | /var/lib/memsql/8d1e9e0b-4bba-40fd-b1ea-8f9d8dd58f86/data... |
| Snapshots_directory | /var/lib/memsql/8d1e9e0b-4bba-40fd-b1ea-8f9d8dd58f86/data... |
| Threads_waiting_for_disk_space | 0 |
| License | BGYyOWIyOTI3OWQyZjQ3ZjNiNjdkYmViYjU4YzdhMDQ5AAAAAAAAAAAAA... |
| License_version | 4 |
| License_capacity | 131072 MB |
| License_expiration | 0 |
| Seconds_until_expiration | -1 |
| License_key | f29b29279d2f47f3b67dbebb58c7a049 |
| License_type | free |
| Maximum_cluster_capacity | 131072 MB |
| Query_compilations | 36802 |
| Query_compilation_failures | 0 |
| Inflight_async_compilations | 0 |
| GCed_versions_last_sweep | 290 |
| Average_garbage_collection_duration | 232 ms |
| Total_server_memory | 51659.9 (+5390.4) MB |
| Total_io_pool_memory | 0.1 MB |
| Free_io_pool_memory | 0.0 MB |
| Alloc_thread_stacks | 2159.000 MB |
| Malloc_active_memory | 5288.381 (+3.980) MB |
| Malloc_transaction_cached_memory | 1054.136 MB |
| Buffer_manager_memory | 6678.6 (+266.0) MB |
| Buffer_manager_cached_memory | 1.0 (-398.4) MB |
| Buffer_manager_unrecycled_memory | 6.9 (+4.4) MB |
| Alloc_skiplist_tower | 1168.000 (-0.625) MB |
| Alloc_variable | 1456.500 MB |
| Alloc_table_primary | 1234.875 (-0.750) MB |
| Alloc_deleted_version | 673.875 (+0.375) MB |
| Alloc_internal_key_node | 453.000 MB |
| Alloc_hash_buckets | 2227.446 MB |
| Alloc_table_metadata_cache | 11.500 MB |
| Alloc_unit_images | 1921.391 (+0.204) MB |
| Alloc_unit_ifn_thunks | 36.743 (+0.016) MB |
| Alloc_object_code_images | 609.118 (+0.083) MB |
| Alloc_compiled_unit_sections | 373.352 (+0.055) MB |
| Alloc_databases_list_entry | 13.125 MB |
| Alloc_plan_cache | 10.750 MB |
| Alloc_warnings | 213.125 (+0.375) MB |
| Alloc_replication_large | 3096.000 MB |
| Alloc_durability_large | 22101.626 MB |
| Alloc_skynet_replication | 0.375 MB |
| Alloc_sharding_partitions | 0.250 MB |
| Alloc_log_replay | 2049.953 (+1.609) MB |
| Alloc_mmap_memory | 7168.000 (+5120.000) MB |
| Alloc_mmap_file | 3072.000 MB |
| Alloc_client_connection | 180.000 (+10.000) MB |
| Alloc_protocol_packet | 342.375 (+0.125) MB |
| Alloc_large_incremental | 0.250 (+0.125) MB |
| Alloc_background_tasks | 913.250 (+650.625) MB |
| Alloc_table_memory | 7213.696 (-1.000) MB |
| Alloc_variable_bucket_16 | allocs:2473390 alloc_MB:37.7 buffer_MB:38.5 cached_buf... |
| Alloc_variable_bucket_24 | allocs:216550 alloc_MB:5.0 buffer_MB:5.5 cached_buffer... |
| Alloc_variable_bucket_32 | allocs:339019 alloc_MB:10.3 buffer_MB:10.9 cached_buff... |
| Alloc_variable_bucket_40 | allocs:870757 alloc_MB:33.2 buffer_MB:120.9 cached_buf... |
| Alloc_variable_bucket_48 | allocs:29019 alloc_MB:1.3 buffer_MB:1.6 cached_buffer_... |
| Alloc_variable_bucket_56 | allocs:24085 alloc_MB:1.3 buffer_MB:3.6 cached_buffer_... |
| Alloc_variable_bucket_64 | allocs:34319 alloc_MB:2.1 buffer_MB:2.8 cached_buffer_... |
| Alloc_variable_bucket_72 | allocs:16092 alloc_MB:1.1 buffer_MB:1.5 cached_buffer_... |
| Alloc_variable_bucket_80 | allocs:7449 alloc_MB:0.6 buffer_MB:0.9 cached_buffer_M... |
| Alloc_variable_bucket_88 | allocs:23751 alloc_MB:2.0 buffer_MB:2.6 cached_buffer_... |
| Alloc_variable_bucket_104 | allocs:134888 alloc_MB:13.4 buffer_MB:14.2 cached_buff... |
| Alloc_variable_bucket_128 | allocs:43119 alloc_MB:5.3 buffer_MB:5.5 cached_buffer_... |
| Alloc_variable_bucket_160 | allocs:1073049 alloc_MB:163.7 buffer_MB:275.8 cached_b... |
| Alloc_variable_bucket_200 | allocs:184550 alloc_MB:35.2 buffer_MB:36.1 cached_buff... |
| Alloc_variable_bucket_248 | allocs:347557 alloc_MB:82.2 buffer_MB:468.2 cached_buf... |
| Alloc_variable_bucket_312 | allocs:31341 alloc_MB:9.3 buffer_MB:23.5 cached_buffer... |
| Alloc_variable_bucket_384 | allocs:29237 alloc_MB:10.7 buffer_MB:12.6 cached_buffe... |
| Alloc_variable_bucket_480 | allocs:707 alloc_MB:0.3 buffer_MB:2.2 cached_buffer_MB... |
| Alloc_variable_bucket_600 | allocs:1188 alloc_MB:0.7 buffer_MB:5.0 cached_buffer_M... |
| Alloc_variable_bucket_752 | allocs:14572 alloc_MB:10.5 buffer_MB:12.5 cached_buffe... |
| Alloc_variable_bucket_936 | allocs:2766 alloc_MB:2.5 buffer_MB:5.6 cached_buffer_M... |
| Alloc_variable_bucket_1168 | allocs:1714 alloc_MB:1.9 buffer_MB:4.0 cached_buffer_M... |
| Alloc_variable_bucket_1480 | allocs:1129 alloc_MB:1.6 buffer_MB:4.1 cached_buffer_M... |
| Alloc_variable_bucket_1832 | allocs:701 alloc_MB:1.2 buffer_MB:3.5 cached_buffer_MB... |
| Alloc_variable_bucket_2288 | allocs:577 alloc_MB:1.3 buffer_MB:3.2 cached_buffer_MB... |
| Alloc_variable_bucket_2832 | allocs:424 alloc_MB:1.1 buffer_MB:4.8 cached_buffer_MB... |
| Alloc_variable_bucket_3528 | allocs:552 alloc_MB:1.9 buffer_MB:6.5 cached_buffer_MB... |
| Alloc_variable_bucket_4504 | allocs:893 alloc_MB:3.8 buffer_MB:7.6 cached_buffer_MB... |
| Alloc_variable_bucket_5680 | allocs:1002 alloc_MB:5.4 buffer_MB:6.8 cached_buffer_M... |
| Alloc_variable_bucket_6224 | allocs:262 alloc_MB:1.6 buffer_MB:2.6 cached_buffer_MB... |
| Alloc_variable_bucket_7264 | allocs:234 alloc_MB:1.6 buffer_MB:3.0 cached_buffer_MB... |
| Alloc_variable_bucket_9344 | allocs:126 alloc_MB:1.1 buffer_MB:1.9 cached_buffer_MB... |
| Alloc_variable_bucket_11896 | allocs:46 alloc_MB:0.5 buffer_MB:1.6 cached_buffer_MB:0.6 |
| Alloc_variable_bucket_14544 | allocs:41 alloc_MB:0.6 buffer_MB:1.6 cached_buffer_MB:0.4 |
| Alloc_variable_bucket_18696 | allocs:40 alloc_MB:0.7 buffer_MB:2.4 cached_buffer_MB:0.9 |
| Alloc_variable_bucket_21816 | allocs:21 alloc_MB:0.4 buffer_MB:2.8 cached_buffer_MB:1.6 |
| Alloc_variable_bucket_26184 | allocs:31 alloc_MB:0.8 buffer_MB:2.2 cached_buffer_MB:0.9 |
| Alloc_variable_bucket_32728 | allocs:36 alloc_MB:1.1 buffer_MB:2.8 cached_buffer_MB:1.2 |
| Alloc_variable_bucket_43648 | allocs:27 alloc_MB:1.1 buffer_MB:3.4 cached_buffer_MB:1.8 |
| Alloc_variable_bucket_65472 | allocs:3221 alloc_MB:201.1 buffer_MB:241.4 cached_buff... |
| Alloc_variable_bucket_130960 | allocs:787 alloc_MB:98.3 buffer_MB:100.2 cached_buffer... |
| Alloc_variable_cached_buffers | 35.2 (+0.6) MB |
| Alloc_variable_allocated | 755.6 MB |
| Successful_read_queries | 1925989179 |
| Successful_write_queries | 352672744 |
| Failed_read_queries | 3420 |
| Failed_write_queries | 3558972 |
| Rows_returned_by_reads | 1213155914 |
| Rows_affected_by_writes | 217704556 |
| Execution_time_of_reads | 266196603 ms |
| Execution_time_of_write | 262665970 ms |
| Transaction_buffer_wait_time | 0 ms |
| Transaction_log_flush_wait_time | 0 ms |
| Row_lock_wait_time | 718482 ms |
| Ssl_accept_renegotiates | 0 |
| Ssl_accepts | 0 |
| Ssl_callback_cache_hits | 0 |
| Ssl_client_connects | 0 |
| Ssl_connect_renegotiates | 0 |
| Ssl_ctx_verify_depth | 18446744073709551615 |
| Ssl_ctx_verify_mode | 0 |
| Ssl_default_timeout | 0 |
| Ssl_finished_accepts | 0 |
| Ssl_finished_connects | 0 |
| Ssl_session_cache_hits | 0 |
| Ssl_session_cache_misses | 0 |
| Ssl_session_cache_overflows | 0 |
| Ssl_session_cache_size | 20480 |
| Ssl_session_cache_timeouts | 0 |
| Ssl_sessions_reused | 0 |
| Ssl_used_session_cache_entries | 0 |
| Ssl_verify_depth | 0 |
| Ssl_verify_mode | 0 |
| Ssl_cipher | |
| Ssl_cipher_list | |
| Ssl_version | |
| Ssl_session_cache_mode | SERVER |
+-----------------------------------------------------------+--------------------------------------------------------------+
adam
August 21, 2020, 3:56pm
4
Hi Sunil,
I would definitely lower transaction_buffer as mentioned above. I would lower it to 8 mb or 16 mb.
memsql-admin update-config --all --key "transaction_buffer" --value "16m"
I don’t see anything else you could tune to lower memory use right now.
Does your cluster have a lot of tables created? There is some small memory overhead per table (a few mb), but it can add up with you have 1000s of tables.
-Adam
sunil
August 21, 2020, 5:27pm
5
Hi Adam,
Thanks a lot for the tip.
Our tables run into 100s, definitely not 1000s.
Would there be any negative impact of lowering the transaction buffer, especially the MemSQL responsiveness?
Thanks,
Sunil
adam
August 21, 2020, 7:30pm
6
It may impact the throughput of bursty row store write workloads a bit - its hard to say how much (workloads that write a few rows per transaction, but in aggregate run 100ks to millions of those a second).
I forgot to mention, that the cluster needs to restart for the change to take effect.
Another bigger hammer is to upgrade to MemSQL 7.X which doesn’t have a static transaction buffer - its dynamically sized.
-Adam
sunil
August 21, 2020, 8:04pm
7
Sure Adam. Understood.
Thanks a lot for detailing all the options.