Hi Team,
Whenever we need to add a new column or modify the column in the pipeline, we have to
Open the PIPELINE DDL, do the required changes
Drop the pipeline or replace the pipeline with create or replace command
We are currently hard coding for PIPELINE DATASURCE FILE with either pattern or specific name, instead of it is there any alternate way to pass the pipeline data source file name as parameter to invoke the pipeline.
The issue is whenever we do drop and recreate pipeline, it picks up all the files with the File Data source pattern that we are specifying, it is corrupting the data to the max extent and becoming difficult to trace out.
We are exercising utmost caution by archiving the previously processed files and some instances any residue files are available in the data source path, when we recreate the pipeline and then trigger the job, it’s picking up all the files in the path.
I would need help on :
Do we have the mechanism to pass the data source file name as dynamic one as parameter)?
Or do we have the way to just alter the pipeline without drop and recreate when we need to modify some of the elements of the pipeline DDL?
Is there any metadata where we can see by running a query or some other mechanism to see : pipeline and filename info without opening the pipeline DDL?
I know that once we run the pipeline we can get the processed file name by querying pipelines_files meta data table.
we want to ensure that there should be one-to-one mapping between the pipeline and data source filename .
I really appreciate if I get the solution for the above points.
Hi Prathap,
Thanks for the detailed explanation. I believe that our ALTER PIPELINE functionality can help solve the issue you’re facing currently, as there is no need to drop and recreate pipelines. You could for example, SET OFFSETS to ignore any files that have already been loaded. There are multiple information schema tables containing pipeline metadata, they can be found in our pipelines docs. Please take a look at the docs and let me know if you have additional questions.
This is not our query. The query is whenever we create pipeline after dropping it, it picks up all the files that are available in the pipeline DDL file path. how can we control that.
In your CREATE PIPELINE, you could add logic such as a WHERE clause. Additionally, you could pipeline into a stored proc to check against the table for records existence before inserting.
For pipeline source, you have a couple options.
ex1:
CREATE PIPELINE testing1
AS LOAD DATA S3 ‘test-bucket/mypath/file.csv’
ex2:
CREATE PIPELINE testing2
AS LOAD DATA S3 ‘test-bucket/mypath*’
Isnt there a way to specify things as parameter. Suppose if we have 10 clients, each client has 20 databases.
Now all these 200 (10 * 20) tables will have different structure, now to load data for these tables, we need to create 200 pipelines as one pipeline can serve only one table.
My question is isn’t there way to specify table name, columns (or mapping) dynamically? If no how many pipelines can be created and how much space or memory do they consume?