Need help with HDFS pipeline

sanraj.oa · March 26, 2020, 8:10pm

I am very much new to memsql.
I am getting error " ERROR 1933 ER_EXTRACTOR_EXTRACTOR_GET_LATEST_OFFSETS: Cannot get source metadata for pipeline. could not walk folder /landingzone/memsql: stat /landingzone/memsql: permission denied "

Here is the syntax I executed in MemSQL Studio.

CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
INTO TABLE my_database.test_tbl
FIELDS TERMINATED BY ‘\t’;

/landingzone/memsql/test_tbl has one text file with fields separated by “\t”.

Here are my questions.

Do I need to configure anything in hdfs server for memsql to work?
What user does memsql’s above syntax use to extract data from hdfs?
How does memsql understand the delimiter in hdfs files?
How does memsql read metadata just from hdfs files?
Is name-node the only way to connect to hdfs?

Thanks in advance.

m_k · March 26, 2020, 11:06pm

Hi Sanraj, thanks for trying out HDFS pipelines.

I think you only need to provide the correct user to talk to HDFS. You can add it to the CREATE PIPELINE statement.

...
CREDENTIALS {"user":"<username>"}
INTO TABLE my_database.test_tbl
...

By default, MemSQL will use "user":"memsql".
All pipelines statements use the following syntax for specifying delimiters. If the file is gzipped, it will be unzipped first, then parsed with delimiter options.
SingleStoreDB Cloud · SingleStore Documentation

    [{FIELDS | COLUMNS}
     TERMINATED BY 'string'
       [[OPTIONALLY] ENCLOSED BY 'char']
       [ESCAPED BY 'char']
    ]
    [LINES
      [STARTING BY 'string']
      [TERMINATED BY 'string']
    ]

MemSQL will fetch file metadata from the HDFS namenode, then files will get downloaded to the leaf nodes from the HDFS datanodes.
Yes, most HDFS operations should start with talking to the namenode, and only afterward connecting to data nodes.

sanraj.oa · March 27, 2020, 4:37pm

Thanks @m_k. I tried below but I got the same permission error. Could this be due to my HDFS being setup with Kerberos authentication? If yes, is there any easy way other than following the steps mentioned in SingleStoreDB Cloud · SingleStore Documentation ?

Also can we directly copy a table from Teradata database to MemSQL?

CREATE PIPELINE my_database.test_pipeline
AS LOAD DATA HDFS ‘hdfs://my-name-node:8020/landingzone/memsql/test_tbl’
CREDENTIALS ‘{“user”:“<my_user_nm>”}’
INTO TABLE my_database. test_tbl
FIELDS TERMINATED BY ‘\t’;

m_k · March 29, 2020, 1:10am

I see. For Kerberos support, you’ll need to additionally configure the cluster to use JRE and krb5-client, and turn on advanced_hdfs_pipelines. The steps are outlined here:

Don’t skip the section down the page, “Authenticating with Kerberos”.

sanraj.oa · March 30, 2020, 8:17pm

@m_k Is there a direct way of copying a table from Teradata to Mem SQL?

nikita · March 31, 2020, 1:37am

Could you please post a separate question? The answer is yes

ayhan.akgun · August 19, 2024, 5:49am

I try to import data from HDFS in ORC file formats, although I dont have any problem importing data from textbased stored files, in hdfs external tables, I got error as
Row 1 doesn’t contain data for all columns in Load process.

Does ORC compressed files in HDFS table also suuported for data source in pipelines ?
Is there any special configuration/parameter for those ?

Any feedback is appreciated.

Thank you