Pipeline keeps failing when loading a parquet file due to temp files

dorsho · March 29, 2021, 7:29am

Hi everyone,
I’ve built a pipeline where I load a parquet file (quite a big one, about 30k partitions if that matters) and it occasionally fails. When I looked into it, I noticed that it happens because it tries to load the temp files that are being created when saving a parquet file. I assume that it tries to load those and then it fails because they disappear when the parquet file is done saving.
When I drop orphan files and restart the pipeline - it works.

Is there any way to make it skip the temp files? or any other workaround?
Thanks

hanson · March 29, 2021, 4:39pm

I’m not sure I understand the problem. What are the temp files you speak of? Does your app create them? Could you create the files elsewhere, then when they are ready, rename the final file to load to be in the appropriate place to be loaded by pipelines, to work around the issue of the temp files being visible to pipelines?

sasha · March 29, 2021, 8:00pm

The temp files are expected, though errors relating to them aren’t. We should only be deleting them when we’re done reading from them. What is the exact error you see, e.g. from information_schema.pipelines_errors?

If you happen to be reading from hdfs, enabling the advanced_hdfs_pipelines global variable will switch to an approach which doesn’t involve temp files. That would work around the issue, though any information you can share to diagnose what’s going wrong would be very helpful.

dorsho · March 30, 2021, 6:43am

Thanks for the quick response.
I’ll shed some more light to clear things up.
I’m using pyspark which writes a parquet file into GCS with 30k partitions.
When the file is being written, it first creates temp files which get renamed later in the process and a _SUCCESS file is generated.
Looking at information_schema.pipelines_errors, I can see 2 errors.
The first:
is Cannot extract data for pipeline. AWS request failed:
NotFound: Not Found
status code: 404, request id: , host id:
BATCH_SOURCE_PARTITION_ID = one of the temp files
few miliseconds later, the second error:
Leaf Error (xxxx): Leaf Error (xxxx): Cannot extract data for pipeline. AWS request failed:
NotFound: Not Found
status code: 404, request id: , host id:

Seems like the ideal solution would be to somehow make the pipeline wait for the _SUCCESS file or skip temp files, but I couldn’t find any option like that in the documentation.
I believe that as a workaround writing the parquet to a 3rd location and only after it move it to the destination would work, but not ideal.

dorsho · April 8, 2021, 11:55am

@sasha @hanson was the information I provided helpful?
looking forward to hear if there’s any fix planned.
Thanks

hanson · April 8, 2021, 10:37pm

I’d have to defer to @sasha on this.