Pipelines with source S3 reads files in csv format and randomly fails with message
“Cannot extract data for pipeline. InvalidRange: The requested range is not satisfiable
status code: 416, request id: …, host id: …”
We can see in attached screenshot that these files are loaded but not on the first try.
We too can see that mostly number of rows on first try is bad, even can be more than in file.
What can be reason and how to understand this error message ?
CREATE OR REPLACE PROCEDURE p_xxx(batch QUERY(fields list))…
CREATE OR REPLACE PIPELINE s3_xxx AS LOAD DATA
LINK DATA_S3 ‘xxx/yyy/*.csv.gz’ INTO PROCEDURE p_xxx
FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘"’
LINES TERMINATED BY ‘\r\n’
IGNORE 1 LINES;
Our pipelines assume that once files are visible in S3, their contents will not change. It looks like the CSV files here may be concurrently modified/reuploaded. That error (and the different row counts between failed vs successful batches) suggests that the size & contents of a file changed while the pipeline was in the process of fetching the CSV. The batch_partition_parsed_rows that we report is the exact number of rows that we found as we parsed the file, up to the point where we hit the error.