Hi,
1)In kafka pipeline suppose my kafka topic has 2 partitions and the memsql database it is writing to has 4 partitions, question is how memsql will distribute data using pipeline from this kafka topic? is each partition of memsql database will get equal load from each partition or some other strategies
2)How about if reverse case is there where database has less partitions compare to kafka topic
3)is there any way to query a database partition to figure out how many and which all rows it contains ?
Will try to answer in as much detail as possible. Please let me know if you have any clarifying questions about the below - and note the use of memsql partitions vs kafka partitions which are very similar but distinctly different terms.
(1) MemSQL will always pair one memsql partition with one kafka partition. In this example, that means one batch of this pipeline will be consumed by 2/4 memsql partitions. By default each memsql pipeline consumes MAX_PARTITIONS_PER_BATCH batch-partitions within one batch (default = numMemsqlPartitions or numKafkaPartitions, whichever is higher). Note that consumed refers to consuming the kafka offsets; all rows in a batch-partition still are committed to their destination memsql partition (which could be different than the memsql partition which is running the pipeline consumer) based on the shard key in the reshuffle step. You may query information_schema.pipelines_batches for information on how many batch-partitions make up one batch and which memsql partitions are doing the consumer work for each batch.
NOTE: with multiple pipelines running concurrently, the pipelines scheduler can only run numMemsqlPartitions total concurrent batches amongst all pipelines, it is helpful to throttle the MAX_PARTITIONS_PER_BATCH parameter if you know some pipelines have less data.
(2) In this case memsql chooses two kafka partitions to consume from first as part of batch 1, and then consumes the other two as part of batch 2.
(3) You can find out which kafka partition and which memsql partition are producing/consuming the data from the BATCH_SOURCE_PARTITION_ID and HOST/PORT/PARTITION fields in information_schema.pipelines_batches. NOTE again that memsql partition 0 being a consumer of 100 rows from kafka does not mean that all 100 rows end up in memsql partition 0; the target partition in memsql is determined by the hash function based on shard key. The way to check for data skew in row counts per memsql partition is by using the query in @hanson’s reply.