Kafka Pipeline Performance Test

minkyung.kang · September 13, 2021, 7:30am

Hi all,

I tested the performance of the Kafka pipeline according to the number of kafca topic partitions.

Test Data size: about 5.5GB (30,000,000 rows)
1MA+4LF in 1 Host(Using one device.)
DATABASE partitions: 32
Data load in Columnstore Table. (Columnstore Index, Shard key = timestamp column)

Below is the test result.

Kafka Topic Partitions CNT	MAX_PARTITIONS _PER_BATCH	pipeline_name	batch_id	batch_time	rows_streamed
1	30	pp_kafka_load	481200	419.967629	30,000,000
10	30	pp_kafka_load2	501461	427.524445	29,959,137
		pp_kafka_load2	501905	0.633521	40,863
30	30	pp_kafka_load3	491164	416.204006	29,891,425
		pp_kafka_load3	491876	1.716615	108,575
32	32	pp_kafka_load4	498693	416.494033	30,000,000

Is the number of partitions in Kafka’s topic not related to improving the pipeline performance?
And why did the pp_kafka_load2 and pp_kafka_load3 ran the batch twice?

I tested pipeline pp_kafka_load5 in a table with the same conditions as pp_kafka_load4 but with shardkeyless table, and I checked that the shardkey is related to pipeline’s performance.

Are there any other tuning point to improve the pipeline’s performance?

Thanks,
MK

jhuang · September 16, 2021, 6:43pm

How many SingleStore partitions do you have? In general, it’s best to have as many Kafka partitions as SingleStore partitions, and after that point performance isn’t likely to improve that much with more Kafka partitions.

minkyung.kang · September 23, 2021, 2:27am

Hi, Jhuang.

The SingleStore had 32 partitions.

I understood that the speed is the fastest when the number of kafka topic partitions are equal to the number of partitions of SingleStore.

But, When the number of kafka topic partitions was set to 10 or 20, the speed was rather slower than when it was 1, and performed two batches.

Can I know why that happens?