Hi Team,
while creating pipeline we are setting the batch_interval =2500 by default, which is the max time that pipeline should wait for the new file to be arrived or between the batches. However we noticed that once the file is completed and if the 2nd file is coming before the batch interval, the pipeline is completing and the pipeline is not picking the just arrived file In some cases we are seeing that it’s processing partial file.
How to we enforce the pipeline to wait for the minimum interval for the new file arrival check? It’s more of like sleep command in Unix/Linux
we are keeping our pipelines in stop mode and triggering them using our internal scheduler process. though the batch is complete and status updated, we would want the pipeline to wait for the specified batch interval before the control goes back to scheduler. Any help in this regard is much appreciable.
The behavior of scheduling w.r.t. batch interval is currently:
If the pipeline has known data available to load, we ignore the batch interval and load it immediately. Once we finish loading all known data, we “immediately” poll for new data one more time, again ignoring the batch interval.
If there is no data to load after the steps above, then we wait for the batch interval before polling for new data again.
The goal of the batch interval is to give you a way to control the polling overhead (which affects total throughput) vs latency tradeoff of your pipelines, rather than to rate limit them in general.
Our usual recommendation for avoiding “partial files” is to exploit pipelines’ filter-by-name features (docs for the s3 variant, for example) to arrange for only “final files” to be visible to the pipeline. For example, CREATE PIPELINE AS LOAD DATA S3 "bucket/prefix" CONFIG '{"suffixes":[".foo"]}', will only load objects whose key starts with prefix and ends with .foo. Is it possible to arrange for your data producer to work with that?
Meanwhile, to wait for everything that you want to load to finish loading, I’d recommend either polling information_schema.pipelines_files if you know the file list, or else running start pipeline foreground at some point after you know that all file uploads have finished. I believe there are some extra caveats with start pipeline foreground + s3 in particular, so I’d probably recommend the pipelines_files approach if you’re working with s3.
Hi Sasha,
Thank you so much for the details information. We are starting pipeline in the foreground thru one of the scheduler jobs and it’s
filesystem_configuration and our files are placed in one of the NAS/NFS. The issue what we are facing is that when we are moving the files from the origination path the NAS path, and then only triggering the Pipeline in the foreground. Though we are able to see all the files are moved to the NFS path, we noticed that sporadically the pipeline is picking up only the partial file and that is causing the issue. So I would want to control it thru Batch_interval.
eg. we are transferring more than 10 files in one go to the NFS path, doing validation of the files with the originator ( no. of files, size, and rec count etc…) to take some time, still it’s processing partial files at times.
I would really appreciate if you could provide me an alternate option so as the pipeline should pick the full file without any issue.
The batch interval is how often that pipeline will wake up to check for new data once it has loaded everything it knows about.
The pipeline is designed to run continuously in the background. If you are starting and stopping your pipelines and triggering them from your scheduler, you should run them in the foreground.
A foreground pipeline will stop when any of the following occurs:
The batch limit is reached (if it is specified).
An error occurs.
All data from the data source has been loaded.
A foreground pipeline is intended for the following purposes:
Testing and debugging.
Running a one-time bulk data load.
Writing your own scheduler to load data at specific times.