Hi all,
I tested the performance of the Kafka pipeline according to the number of kafca topic partitions.
Test Data size: about 5.5GB (30,000,000 rows)
1MA+4LF in 1 Host(Using one device.)
DATABASE partitions: 32
Data load in Columnstore Table. (Columnstore Index, Shard key = timestamp column)
Below is the test result.
Kafka Topic Partitions CNT | MAX_PARTITIONS _PER_BATCH | pipeline_name | batch_id | batch_time | rows_streamed |
---|---|---|---|---|---|
1 | 30 | pp_kafka_load | 481200 | 419.967629 | 30,000,000 |
10 | 30 | pp_kafka_load2 | 501461 | 427.524445 | 29,959,137 |
pp_kafka_load2 | 501905 | 0.633521 | 40,863 | ||
30 | 30 | pp_kafka_load3 | 491164 | 416.204006 | 29,891,425 |
pp_kafka_load3 | 491876 | 1.716615 | 108,575 | ||
32 | 32 | pp_kafka_load4 | 498693 | 416.494033 | 30,000,000 |
Is the number of partitions in Kafka’s topic not related to improving the pipeline performance?
And why did the pp_kafka_load2 and pp_kafka_load3 ran the batch twice?
I tested pipeline pp_kafka_load5 in a table with the same conditions as pp_kafka_load4 but with shardkeyless table, and I checked that the shardkey is related to pipeline’s performance.
Are there any other tuning point to improve the pipeline’s performance?
Thanks,
MK