Hello,
I am working on setting up a memSQL Pipeline to read data in .parquet format form an S3 bucket. I successfully got a pipeline working using the MySQL Command-Line Client when reading from a single file in .csv format using the following syntax:
mysql> CREATE OR REPLACE PIPELINE `csv_test_pipeline`
AS LOAD DATA S3 "s3://my_bucket/csv_test_data.csv"
SKIP DUPLICATE KEY ERRORS
INTO TABLE `csv_test_table`
FIELDS TERMINATED BY ','
(`CCID_1`,
`CCID_2`,
`CCID_3`);
Query OK, 0 rows affected (0.16 sec)
mysql> START PIPELINE csv_test_pipeline FOREGROUND;
Query OK, 1001 rows affected (0.51 sec)
However, I can’t get a .parquet pipeline to work. I wrote out the same dataset as above to the same bucket in .parquet format instead of .csv format using Spark/Databricks. I am able to create the pipeline, but get an error when running it.
EDIT: I realized after originally posting that I had snappy encryption enabled on the parquet file by default. I have since re-written the file with compression=none.
mysql> CREATE OR REPLACE PIPELINE `parquet_test_pipeline`
AS LOAD DATA S3 "s3://my_bucket/test_data.parquet"
SKIP DUPLICATE KEY ERRORS
INTO TABLE `parquet_test_table`
(`CCID_1` <- `CCID_1`,
`CCID_2` <- `CCID_2`,
`CCID_3` <- `CCID_3`)
FORMAT PARQUET;
-- Query OK, 0 rows affected (0.17 sec)
mysql> START PIPELINE parquet_test_pipeline FOREGROUND;
-- ERROR 1934 (HY000): Leaf Error (<internal_ip>:3306): Leaf Error (<internal_ip>:3306): Cannot extract data for pipeline. AWS request failed:
NotFound: Not Found
status code: 404, request id: 958..., host id: C3BZ...
The parquet file was written out using Spark running on Databricks. The S3 folder contains the usual Spark .parquet files along with (in this case) one data file:
s3://my_bucket/test_data.parquet/_SUCCESS
s3://my_bucket/test_data.parquet/_committed_5112107910452879650
s3://my_bucket/test_data.parquet/_started_5112107910452879650
s3://my_bucket/test_data.parquet/part-00000-tid-51...c000.parquet
I’ve tried using different variations of the S3 path (e.g. /test_data.parquet, /test_data.parquet/, /test_data.parquet/*, etc.), but always get the same 404 error. As a test, I sshed into the leaf node in question and copied the S3 object in question using AWS CLI with no issues. I’m not sure how this could be an AWS access issue given the .csv pipeline has no issues.
Any help is appreciated.