What kind of scale requirements do you have, like how many rows per second do you want to move into SingleStore from Pulsar?
We don’t have a plan to develop a Pulsar or Flink sink connector.
You can add multiple aggregators if you need to handle high scale, and split work across those aggregators (with app logic or a load balancer). Going through the aggregators is the standard way to program against SingleStore. Going directly to the leaves is not something that regular client application software should do.
I would certainly start with the built-in JDBC connector and see how it does. To ensure maximum performance point the sink at a load balancer over all of your child aggregators (this allows you to horizontally scale ingest) and experiment with different batch sizes. I bet you will be pleasantly surprised by the out of box performance.
If you want to go farther, check out our official Kafka Confluent Connect connector for an example of how to squeeze out every drop of ingest performance. One of the key opttimizations is to switch from batch inserts to using LOAD DATA along with compressing the input stream.
@carl - In your last paragraph, you’ve meant that we should implement our own connector (that reads from Pulsar) instead of the JDBC one, “mimicking” the existing Kafka connector code and use LOAD DATA and in addition, insert directly to leaves?
Also relating to “How many rows/s we intend to insert” - for now, it will be about few tens of thousands of inserts per sec, like a total of 20-30k inserts spanned across 6 big facts tables, each containing 10M-50M rows already. Half of the inserts will be “INSERT … ON DUPLICATE KEY UPDATE”.
I guess that a relatively small cluster of SingleStore should meet this load?
We are comparing performance / $ ratio with RDS and Aurora… so I should assume that since SingleStore insert performance should be much better we will need less h/w?
It’s an option althought I would start with testing the JDBC sink rather than building your own.
Also - don’t insert into the leaf nodes directly - just insert into the child aggregators, but use a load balancer to spread out the inserts over all of your aggs.
Yes - that load is very small for SingleStore assuming you set up the schemas right. I don’t think you need to build a custom connector - the normal JDBC connector should work just tfine.
That is a reasonable assumption - but I am very excited to hear the results. I highly recommend jumping on a call with one of our engineers Sarung Tripathi stripathi@singlestore.com to discuss your plans if you want to absolutely maximize your cost efficiency.
What is the performance diff (insert rate) between the JDBC connector or the Kafka sink connector (both go through the aggregators) VS SingleStore Kafka pipeline that can insert directly to the leafs?
Also, how many aggregators we need to add in order to compensate for this?
Just to confirm I understood your comment above on “…switch from batch inserts to using LOAD DATA along with compressing the input stream” : You suggested that in case we want additional optimization, we can adjust the JDBC connector source code to somehow be SingleStore specific and use LOAD DATA? Also, isn’t LOAD DATA is mainly to read data from files or can it also read it from a stream?
Does just using bulk inserts (from a JDBC connector) also do them in parallel?
This has not been explicitly measured since it heavily depends on the cluster configuration. As a start, please test the built in JDBC sink to see if it meets your performance metrics. There are many tweaks we can do to the setup to reduce the load on the aggregators which would allow you to run this solution without excessive child aggregators. Metrics you should monitor include: CPU/network load on aggs, network load between aggs and leaves, cpu load on leaves. You can play with different SHARD keys to improve performance.
I am not entirely sure what you mean. A batch insert is a query of the form insert into... values (row 1), (row 2), .... You can open multiple connections to SingleStore and run a separate batch insert on each connection if you want to run things in parallel. The normal JDBC connectors have no concept of parallelism by default - it has to be added on in the wrapping application, for example in the Pulsar Sink connector (which probably runs batchs in parallel automatically)