Hello,
I am working on setting up a memSQL Pipeline to read data in .parquet
format form an S3 bucket. I successfully got a pipeline working using the MySQL Command-Line Client when reading from a single file in .csv
format using the following syntax:
mysql> CREATE OR REPLACE PIPELINE `csv_test_pipeline`
AS LOAD DATA S3 "s3://my_bucket/csv_test_data.csv"
SKIP DUPLICATE KEY ERRORS
INTO TABLE `csv_test_table`
FIELDS TERMINATED BY ','
(`CCID_1`,
`CCID_2`,
`CCID_3`);
Query OK, 0 rows affected (0.16 sec)
mysql> START PIPELINE csv_test_pipeline FOREGROUND;
Query OK, 1001 rows affected (0.51 sec)
However, I can’t get a .parquet
pipeline to work. I wrote out the same dataset as above to the same bucket in .parquet
format instead of .csv
format using Spark/Databricks. I am able to create the pipeline, but get an error when running it.
EDIT: I realized after originally posting that I had snappy
encryption enabled on the parquet
file by default. I have since re-written the file with compression=none
.
mysql> CREATE OR REPLACE PIPELINE `parquet_test_pipeline`
AS LOAD DATA S3 "s3://my_bucket/test_data.parquet"
SKIP DUPLICATE KEY ERRORS
INTO TABLE `parquet_test_table`
(`CCID_1` <- `CCID_1`,
`CCID_2` <- `CCID_2`,
`CCID_3` <- `CCID_3`)
FORMAT PARQUET;
-- Query OK, 0 rows affected (0.17 sec)
mysql> START PIPELINE parquet_test_pipeline FOREGROUND;
-- ERROR 1934 (HY000): Leaf Error (<internal_ip>:3306): Leaf Error (<internal_ip>:3306): Cannot extract data for pipeline. AWS request failed:
NotFound: Not Found
status code: 404, request id: 958..., host id: C3BZ...
The parquet file was written out using Spark running on Databricks. The S3 folder contains the usual Spark .parquet
files along with (in this case) one data file:
s3://my_bucket/test_data.parquet/_SUCCESS
s3://my_bucket/test_data.parquet/_committed_5112107910452879650
s3://my_bucket/test_data.parquet/_started_5112107910452879650
s3://my_bucket/test_data.parquet/part-00000-tid-51...c000.parquet
I’ve tried using different variations of the S3 path (e.g. /test_data.parquet
, /test_data.parquet/
, /test_data.parquet/*
, etc.), but always get the same 404 error. As a test, I ssh
ed into the leaf node in question and copied the S3 object in question using AWS CLI
with no issues. I’m not sure how this could be an AWS access issue given the .csv
pipeline has no issues.
Any help is appreciated.