Hi,
I want to load data to a MemSQL Table, using a pipeline, from a HDFS folder.
Each day one or two csv files are generated in this folder, but not all the files have the same schema, for example, one file might have 12 columns and other 6 and other 18 (it depends on what happened that day).
Is there any way to load only the columns I am interested? (3 columns that always appear in the csv file)
For example, csv file1 has:
"code","number","year","C"
"blab","15885","2016","Y"
"aeea","15883","1982","E"
"xpto","15884","1986","B"
csv file2 has:
"code","number","year","C","M","type"
"geom","48956","1991","C","A","1"
"goog","77895","1999","E","R,"2""
"ecua","22354","1988","B","G","1"
Then I created the following table:
CREATE TABLE my_table (
code VARCHAR(10),
number INT,
);
And I only want to load data from the first two columns of each csv file, so I create a pipeline:
CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://path/output_folder/'
INTO TABLE `my_table`
FIELDS TERMINATED BY ','
(code, number);
But gives me error, as the csv file contains more columns than available columns in the table (of course)… is there any way to tell the pipeline to only load this two specific columns and ignore any other? (Take into consideration that each csv file may contain different column names each time)
Thanks!!