How to Truncate-Load data from Pipeline

vshesha · March 13, 2020, 12:30am

I am trying to use a pipeline to load from S3 bucket.

This S3 bucket is the output of data I am creating and I will want to truncate-load this on a weekly interval.

What is the best option to grab all the data from S3 and overwrite the data in my MEMSQL table?

Cheers!

JoYo · March 13, 2020, 1:06am

why, truncate table, of course! you’ll need to coordinate with your app somehow, probably like

stop the pipeline,
truncate the s3 bucket
truncate the table
restart the pipeline
repopulate the s3 bucket

vshesha · March 13, 2020, 4:50pm

Hello JoYo,

Thank you for the quick response! I tried that yesterday and the issue was with what I am seeing in the “information_schema.pipeline_files.”

When I write my data to the S3 file, I am overwriting the files with file names similiar to:
0000_part_00
0001_part_00
0002_part_00

and so forth.

When I truncate the table, those files are still being shown as “loaded”

Is there a way to get the pipeline to grab the files and overwrite the table?
Do I need to delete the data in the information_schema.pipeline_files regarding this pipeline and/or Can I unload those files from this table to enable the pipeline to accept the same files?

I think it is important for me to state my goal: I want to overwrite the same data in S3 file with update data every week. When that S3 bucket is loaded with new data, I want to start a pipeline to grab this updated data and overwrite the data on the table.

Thanks again!

vshesha · March 17, 2020, 12:08am

Bump in case this got lost over the weekend.

@JoYo - Any thoughts/suggestions?

rnarayanan · March 19, 2020, 3:41am

Hi Vishesh,

JoYo should be responding to your specific question shortly.
In the meantime, wanted to know how exactly did you deploy MemSQL? Not sure if you knew, we have MemSQL in AWS Marketplace as well as AWS QuickStart. You can either spin up a BYOL listing or the PAID (on demand) listing.

Thanks for your patience
Ramesh Narayanan

JoYo · March 19, 2020, 3:41pm

I see. If you want to start a pipeline from the beginning, you can do

alter pipeline <name> set offsets earliest

if you want to tell a pipeline to reload a specific file, you can do

alter pipeline <name> drop file <file>

there is no way to do this automatically. if a file changes the pipeline wont detect it.
If you’re completely rewriting your bucket, its probably best to just set the offsets earliest.

vshesha · March 19, 2020, 4:09pm

Hello @JoYo,

Thank you for your response!! I will give that a try.

Appreciate you getting back to me!! Have a great day.

Vishesh

vshesha · March 19, 2020, 4:09pm

Hello @rnarayanan,

I utilized a cloud formation template on AWS (well my coworker did).

Cheers,
Vishesh

JCM · June 12, 2024, 4:56pm

is it possible to do it with multiple files like using wildcards?