Data Type and String comparison issue when using SELECT INTO S3 ... FORMAT PARQUET

neelesh.nirmal · January 4, 2024, 10:12am

Our use case is that we have a table in Singlestore, that we want to export to S3 in parquet format so that downline ETL piplines can consume the data from S3 directly.

Here is the create table query which we want to export to S3:

CREATE TABLE `sample_invest` (
  `url` text CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `type` varchar(256) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  `is_lead` tinyint(1) DEFAULT NULL,
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'

This is the query that we are using to export table to S3 in parquet format:

SELECT url, type, is_real
FROM sample_invest
INTO S3 'my-s3_path' 
CONFIG '{\"region\": \"us-east-1\", \"multipart_chunk_size_mb\": 5}' 
CREDENTIALS '{\"role_arn\": "my-role"}' 
FORMAT PARQUET;

After running above query the parquet files are avaible in S3. Now we are running a Pyspark job to consume these files. Here is our dataframe which has only 5 records:

In [38]: df=spark.read.parquet('my-s3-path')

In [39]: df.show()
+--------------------+----------+-------+
|                 url|      type|is_real|
+--------------------+----------+-------+
|[68 74 74 70 73 3...|investment|   null|
|[68 74 74 70 73 3...|investment|      1|
|[68 74 74 70 73 3...|investment|   null|
|[68 74 74 70 73 3...|investment|      0|
|[68 74 74 70 73 3...|investment|   null|
+--------------------+----------+-------+

In [42]: df.printSchema()
root
 |-- url: binary (nullable = true)
 |-- type: string (nullable = true)
 |-- is_real: byte (nullable = true)

Now here are the problems that we are facing:

Data type for url should be string but it is binary.
Data type for is_real should be boolean but it is byte.
Can’t perform string comparisons, Check following code snippets

# Not returning anything on string comparison.
In [41]: df.filter('type="investment"').count()
Out[41]: 0

In [43]: df.filter('type!="investment"').count()
Out[43]: 0

nlello · January 5, 2024, 9:43am

Would CAST(url AS VARCHAR(1024)) AS url help resolve the data type for that field ?
Unsure about is_real ( I assume you mean is_lead ? ) but would a CASE returning the string ‘true’ or ‘false’ work ?

neelesh.nirmal · January 5, 2024, 1:03pm

@nlello

Yes it will but again the problem is we can’t use those string values for comparison(see the 3rd point).
Yes is_read. Unable to edit the question now.
Yes the option to transform the data as per the need is always available but given such issues, can we rely on the parquet file that will be created?