I have a scenario where I’m exporting tables to S3 using SELECT INTO S3... query.
This is the sample command(python) that I am using to export:
export_to_s3_sql = f"SELECT * FROM {database}.{table} " \
f"INTO S3 '{s3_full_path}/{database}/{table}/{table}' " \
f"CONFIG '{json.dumps(export_config)}' " \
f"CREDENTIALS '{json.dumps(export_creds)}' " \
f"FIELDS TERMINATED BY '\\t' ENCLOSED BY '\"' ESCAPED BY '\\\\' " \
f"LINES TERMINATED BY '\\r'"
Once exported is complete, I read the tables using Spark. My tables are big in size(~2TB) and sometimes they contain newline character in column value.
My aim is to identify whether my CSV in S3 contains newline characters as column value or not when reading via Spark.
One way to verifying the row counts. but a table might get updated during, before or after the query execution.
My first question is, what will happen if table gets updated during the execution of above query ?
During the export is there any simple way to know if any of the column contains newline character ?
Can I somehow save row count in a separate file ? Given that a table may get update instruction anytime.
During the export, Is there any way to replace all the newline character in column values with empty strings ?
Is there any way I can add a new column with some default value ? I will use this flag to detect if my csv has new lines characters as
My first question is, what will happen if table gets updated during the execution of above query ?
→ the query will output any committed rows it sees during execution. It could see part of a committed transaction, but never uncommitted updates. The query outputs whatever rows it retrieves.
During the export is there any simple way to know if any of the column contains newline character ?
→ you could use a LIKE or RLIKE or similar function to check for presence of newlines, instead of using select * E.g. try running select 'abc\nxyz' as x, x LIKE '%\n%';
Can I somehow save row count in a separate file ? Given that a table may get update instruction anytime.
you could try select found_rows(); in the next query–it works with regular queries but I don’t know about SELECT INTO S3.
During the export, Is there any way to replace all the newline character in column values with empty strings ?
You could use a replace function of some kind to do it but then you’d have to list out all the columns with an appropriate expression and not use select *
Is there any way I can add a new column with some default value ? I will use this flag to detect if my csv has new lines characters as
Yes, but again, you’d have to use an expression in the SELECT list, not SELECT *
I am also curious if you need to export the data to CSV for other reasons or are you just reading them with Spark. If the latter, are you aware that we have a Spark Connector that you could use to connect directly to SingleStore. There are performance optimizations in the form of SQL Pushdown and Parallel Read capabilities that could greatly speed up the read process.