I am very much new to memsql.
I am getting error " ERROR 1933 ER_EXTRACTOR_EXTRACTOR_GET_LATEST_OFFSETS: Cannot get source metadata for pipeline. could not walk folder /landingzone/memsql: stat /landingzone/memsql: permission denied "
Here is the syntax I executed in MemSQL Studio.
CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
INTO TABLE my_database.test_tbl
FIELDS TERMINATED BY ‘\t’;
/landingzone/memsql/test_tbl has one text file with fields separated by “\t”.
Here are my questions.
Do I need to configure anything in hdfs server for memsql to work?
What user does memsql’s above syntax use to extract data from hdfs?
How does memsql understand the delimiter in hdfs files?
How does memsql read metadata just from hdfs files?
I think you only need to provide the correct user to talk to HDFS. You can add it to the CREATE PIPELINE statement.
...
CREDENTIALS {"user":"<username>"}
INTO TABLE my_database.test_tbl
...
By default, MemSQL will use "user":"memsql".
All pipelines statements use the following syntax for specifying delimiters. If the file is gzipped, it will be unzipped first, then parsed with delimiter options. SingleStoreDB Cloud · SingleStore Documentation
[{FIELDS | COLUMNS}
TERMINATED BY 'string'
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
MemSQL will fetch file metadata from the HDFS namenode, then files will get downloaded to the leaf nodes from the HDFS datanodes.
Yes, most HDFS operations should start with talking to the namenode, and only afterward connecting to data nodes.
Thanks @m_k. I tried below but I got the same permission error. Could this be due to my HDFS being setup with Kerberos authentication? If yes, is there any easy way other than following the steps mentioned in SingleStoreDB Cloud · SingleStore Documentation ?
Also can we directly copy a table from Teradata database to MemSQL?
CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
CREDENTIALS ‘{“user”:“<my_user_nm>”}’
INTO TABLE my_database. test_tbl
FIELDS TERMINATED BY ‘\t’;
I see. For Kerberos support, you’ll need to additionally configure the cluster to use JRE and krb5-client, and turn on advanced_hdfs_pipelines. The steps are outlined here:
Don’t skip the section down the page, “Authenticating with Kerberos”.
I try to import data from HDFS in ORC file formats, although I dont have any problem importing data from textbased stored files, in hdfs external tables, I got error as
Row 1 doesn’t contain data for all columns in Load process.
Does ORC compressed files in HDFS table also suuported for data source in pipelines ?
Is there any special configuration/parameter for those ?