How to do upsert without causing too much memory spike?

bhaskar · August 7, 2024, 11:05am

Hi,

We have to do around 100k records upsert into a table. The table has around 200-300 columns. What we are currently doing is dividing that 100k records into 1k batch and doing the upserts. But this is causing a high memory spike. The table is columnar.

What is the best way to do heavy upsert into a columnar table?

hanson1 · August 13, 2024, 10:22pm

Batching is a good idea as I mentioned on

It’s possible even 1k batches are too big, though not likely. Consider reducing the batch size to 500 or 250, say.

I’m not sure what’s causing you to run out of memory. What error message are you getting? It might be that the in-memory rowstore segment is getting too big. Normally it would flush automatically though, so that’s probably not it. But you could try running OPTIMIZE TABLE FLUSH after every 10 batches or so. That would flush the contents of the in-memory rowstore segment to disk.

bhaskar · October 2, 2024, 8:14am

Thanks @hanson1

Suppose if I am running flush on on table and the query happens for the same table parallel then it won’t affect right?

also following are the high memory usage things:

memsql_status_alloc_query_ingest → 8.1 GB
memsql_status_alloc_query_execution → 5.5 GB
memsql_status_malloc_active_memory → 4.3 GB
memsql_status_alloc_unit_images → 2.8 GB
memsql_status_alloc_table_primary → 2.6 GB
memsql_status_alloc_thread_stacks → 1 GB
memsql_status_alloc_system_tasks → 613 MB
memsql_status_alloc_hash_buckets → 998 MB
memsql_status_alloc_skiplist_tower → 700 MB
memsql_status_alloc_variable → 700 MB
memsql_status_alloc_compiled_unit_sections → 700 MB

is there a documentation where I can check what all these parameters means?

hanson1 · October 7, 2024, 11:38pm

If you are running FLUSH on a table and you are querying the table at the same time, they will be able to run at the same time concurrently without blocking each other – FLUSH is an online operation. But of course they will be competing with each other for CPU time (sharing time).

However, there is a brief blocking period at the start of the optimization process. The OPTIMIZE TABLE command will wait until all DML queries that were already running on the table finish before it begins optimizing. This allows in-progress queries to complete and ensures consistency of results ( 1, 2 ).

The different allocators are defined here:

IslaBell · November 4, 2024, 3:18pm

Thank you for the link, you made my day.