We have a columnstore table that was created using a query like the one below:
CREATE TABLE key_metrics (
source_id TEXT,
date TEXT,
metric1 FLOAT,
metric2 FLOAT,
…
SHARD KEY (source_id
, date
) USING CLUSTERED COLUMNSTORE
);
We have an application that uses Spark (with Spark Job Server) that queries the MemSQL table. Below is a simplified form of the kind of Dataframe operations we are doing in Scala:
sparkSession
.read
.format(“com.memsql.spark.connector”)
.options( Map (“path” -> “dbName.key_metrics”))
.load()
.filter(col(“source_id”).equalTo(“12345678”)
.filter(col(“date”)).isin(Seq(“2019-02-01”, “2019-02-02”, “2019-02-03”))
Through observing the physical plan constructed for the DataFrame, I have confirmed that these filter predicates are indeed being pushed down to MemSQL
It is my understanding that with partition pushdown, during bulk loading, we can leverage parallelism across all the cores of the machines by creating as many spark tasks as there are MemSQL database partitions. However upon running the Spark pipeline and observing the Spark UI, it seems that there is only one spark task that is created which makes a single query to the DB thereby using only a single core.
I have checked that there is a pretty even distribution of the partitions in the table:
±--------------±----------------±-------------±-------±-----------+
| DATABASE_NAME | TABLE_NAME | PARTITION_ID | ROWS | MEMORY_USE |
±--------------±----------------±-------------±-------±-----------+
| dbName | key_metrics | 0 | 784012 | 0 |
| dbName | key_metrics | 1 | 778441 | 0 |
| dbName | key_metrics | 2 | 671606 | 0 |
| dbName | key_metrics | 3 | 748569 | 0 |
| dbName | key_metrics | 4 | 622241 | 0 |
| dbName | key_metrics | 5 | 739029 | 0 |
| dbName | key_metrics | 6 | 955205 | 0 |
| dbName | key_metrics | 7 | 751677 | 0 |
±--------------±----------------±-------------±-------±-----------+
I have made sure that the following properties are set as well:
spark.memsql.disablePartitionPushdown = false
spark.memsql.defaultDatabase = “dbName”
Is my understanding of partition pushdown incorrect? Is there some other configuration that I am missing?
Would appreciate your input on this.
Thanks!
Varun