I have two pipelines connected each to a kafka topic.
The two pipelines need to manipulate records of the same table and I got a ER_LOCK_WAIT_TIMEOUT. I guess one pipeline is locking the table when it is running and the other can not access it. What is the recommended implementation for this kind of problem ?
Hi Frederick, thanks for trying out Kafka pipelines!
could you share some more details about the workload? does the error happen on every batch, or intermittently? how often to the pipelines run? what is the size of the batch? are the two pipelines defined straight into table, or into stored procedure first?
It is happening intermittently and mainly when the batch size is big (1 000 000). When I reduce the batch size, the performance is degrading but the problem is less frequent.
The two pipelines are loading data into a dedicated store procedure.
Here is the process
I have two kafka topics containing de-normalized data:
first topic => assays (name, description, metadata), I am using a store procedure because I need to create tri-grams in another table for each assay name.
second topic => experiments (experiment_identifier, experiment_date, assay_name)
I would like to use a proper id in the experiments table and not repeating the assay name
Also because the two pipelines are running concurrently, I don’t know when the assay will be ready in the assay table.
My strategy is to create the record in the assay table from both pipelines.
If the experiment flow is first then the insert will generate an id and the record with only contain the assay name, then the second pipeline will update the record and will add the other fields.
If the assay flow is first then the pipeline will insert a full record in the assay table and the second pipeline will just pick the id.
The locking happens when the same record is inserted / updated in the same time.
Hi Frederick. I ran this question by some engineers, so let me give you some options to experiment with
If it’s possible to merge the Kafka topics into one, and having only one pipeline and stored procedure, that is most likely the most performant way of ingesting all the data and not have contending transactions.
In recent versions of MemSQL, you could assign pipelines to run in a specific resource pool, and the pool could enforce concurrency of only 1 query. SingleStoreDB Cloud · SingleStore Documentation If both pipelines are sharing the resource pool, it should have the effect of eliminating situations where one pipeline is blocking another. The downside is likely lower total throughput.
Another idea was to potentially create a third table of assays that have not arrived via first topic, but did arrive via second. In essence, you separate the tables into ones that are inserted by first pipe, and ones that are updated by the second. Obviously schema changes like that could potentially require more redesign.
Please let us know which option you decide to try and your results.