Our use case is that we have a table in Singlestore, that we want to export to S3 in parquet format so that downline ETL piplines can consume the data from S3 directly.
Here is the create table query which we want to export to S3:
CREATE TABLE `sample_invest` (
`url` text CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`type` varchar(256) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`is_lead` tinyint(1) DEFAULT NULL,
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'
This is the query that we are using to export table to S3 in parquet format:
SELECT url, type, is_real
FROM sample_invest
INTO S3 'my-s3_path'
CONFIG '{\"region\": \"us-east-1\", \"multipart_chunk_size_mb\": 5}'
CREDENTIALS '{\"role_arn\": "my-role"}'
FORMAT PARQUET;
After running above query the parquet files are avaible in S3. Now we are running a Pyspark job to consume these files. Here is our dataframe which has only 5 records:
In [38]: df=spark.read.parquet('my-s3-path')
In [39]: df.show()
+--------------------+----------+-------+
| url| type|is_real|
+--------------------+----------+-------+
|[68 74 74 70 73 3...|investment| null|
|[68 74 74 70 73 3...|investment| 1|
|[68 74 74 70 73 3...|investment| null|
|[68 74 74 70 73 3...|investment| 0|
|[68 74 74 70 73 3...|investment| null|
+--------------------+----------+-------+
In [42]: df.printSchema()
root
|-- url: binary (nullable = true)
|-- type: string (nullable = true)
|-- is_real: byte (nullable = true)
Now here are the problems that we are facing:
- Data type for
url
should bestring
but it isbinary
. - Data type for
is_real
should beboolean
but it isbyte
. - Can’t perform string comparisons, Check following code snippets
# Not returning anything on string comparison.
In [41]: df.filter('type="investment"').count()
Out[41]: 0
In [43]: df.filter('type!="investment"').count()
Out[43]: 0