Schema and Pipeline Inference for SingleStore

SingleStore offers native capabilities called SingleStore Pipelines to ingest data with high speeds. The pipelines enable loading data from a variety of sources like S3, Kafka, etc. and in a variety of formats including CSV, JSON, Avro and Parquet.

Pipelines are one of the most powerful tools that enable rapid integration with data sources. However, to use the pipelines' functionality, one key challenge is pre-defining the schema and pipeline DDL syntax to make it useful — and it needs to be constantly updated for ease of use through Alter syntax.

In the latest 8.7 release of SingleStore, we are happy to introduce automatic schema and pipeline DDL inference that simplify the initial setup process, enabling users to quickly start leveraging their data and reduce time to insights for customers.

Why is this important for customers?

For developers, schema inference means accelerated setup times for new databases and the ability to quickly run queries on their data. This automation minimizes errors, providing the flexibility to tailor schema and pipeline setups, enhancing developer productivity. By automatically analyzing the data's structure of the remote file, this helps generate necessary table schemas and pipeline definitions — recognizing column names, data types and file formats effortlessly.

This can then be either applied as is, or modified before applying and executing. The input_configuration may be a configuration for loading from Apache Kafka, Amazon S3, a local filesystem, Microsoft Azure, HDFS and Google Cloud Storage. For specification on the syntax, refer to CREATE PIPELINE.

Using schema inference in SingleStore involves simple commands that enable users to connect to data files and generate the schema definitions. Users can choose to review, modify and apply the suggestions provided.

How to use the new functionality

Infer table/pipeline to generate and view schema suggestions

The following demonstrates the SQL commands that create the tables and pipelines. When run, this gives users the option to make any modifications before running the generated commands.

1
INPUT
2

3
INFER TABLE AS LOAD DATA S3 's3://data_folder/books.avro'
4
CONFIG '{"region":"<region_name>"}'
5
CREDENTIALS '{
6
    "aws_access_key_id":"<your_access_key_id>",
7
    "aws_secret_access_key":"<your_secret_access_key>",
8
    "aws_session_token":"<your_session_token>"}'
9
FORMAT AVRO;
10

11
OUTPUT
12

13
"CREATE TABLE `infer_example_table` (
14
    `id` int(11) NOT NULL,
15
    `name` longtext CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
16
    `num_pages` int(11) NOT NULL,
17
    `rating` double NULL,
18
    `publish_date` bigint(20) NOT NULL)"
19

20

21
INFER PIPELINE AS LOAD DATA {input_configuration}

Pipeline inference

1
INPUT
2

3
INFER PIPELINE AS LOAD DATA S3
4
        's3://data_folder/books.avro'
5
CONFIG '{"region":"<region_name>"}'
6
CREDENTIALS '{
7
    "aws_access_key_id":"<your_access_key_id>",
8
    "aws_secret_access_key":"<your_secret_access_key>",
9
    "aws_session_token":"<your_session_token>"}'
10
FORMAT AVRO;
11

12
OUTPUT
13

14

15
"CREATE TABLE `infer_example_table` (
16
    `id` int(11) NOT NULL,
17
    `name` longtext CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
18
    `num_pages` int(11) NOT NULL,
19
    `rating` double NULL,
20
    `publish_date` bigint(20) NOT NULL);
21
CREATE PIPELINE `infer_example_pipeline`
22
AS LOAD DATA S3 's3://data-folder/books.avro'
23
CONFIG '{\""region\"":\""us-west-2\""}'
24
CREDENTIALS '{\n    \""aws_access_key_id\"":\""your_access_key_id\"",
25
\n    \""aws_secret_access_key\"":\""your_secret_access_key\"",
26
\n    \""aws_session_token\"":\""your_session_token\""}'
27
BATCH_INTERVAL 2500
28
DISABLE OUT_OF_ORDER OPTIMIZATION
29
DISABLE OFFSETS METADATA GC
30
INTO TABLE `infer_example_table`
31
FORMAT AVRO(
32
    `id` <- `id`,
33
    `name` <- `name`,
34
    `num_pages` <- `num_pages`,
35
    `rating` <- `rating`,
36
    `publish_date` <- `publish_date`);"

Create inferred pipeline

This creates the required table and pipeline for loading the data from the source.

1
CREATE INFERRED PIPELINE books_pipe AS LOAD DATA S3
2
         's3://data_folder/books.avro'
3
CONFIG '{"region":"<region_name>"}'
4
CREDENTIALS '{
5
    "aws_access_key_id":"<your_access_key_id>",
6
    "aws_secret_access_key":"<your_secret_access_key>",
7
    "aws_session_token":"<your_session_token>"}'
8
FORMAT AVRO;

What’s next

SingleStore's schema and pipeline inference dramatically cuts down manual setup, accelerating time to insights. This is the first iteration of the schema inference for expanded support for additional file types like JSON and Parquet, further enhancing its data processing capabilities.

Schema detection is just the beginning of our investments to developer productivity and simplifying data ingestion problems. We plan to support JSON and Parquet data files with schema inference — and with support for source file schema changes, and schema evolution and inference capabilities. Dive in and experience the power of streamlined data processing firsthand. Create a free workspace, load your data and start your data journey.