Schema and Pipeline Inference for SingleStore

Clock Icon

2 min read

Pencil Icon

Jul 1, 2024

SingleStore offers native capabilities called SingleStore Pipelines to ingest data with high speeds. The pipelines enable loading data from a variety of sources like S3, Kafka, etc. and in a variety of formats including CSV, JSON, Avro and Parquet.

Schema and Pipeline Inference for SingleStore

Pipelines are one of the most powerful tools that enable rapid integration with data sources. However, to use the pipelines' functionality, one key challenge is pre-defining the schema and pipeline DDL syntax to make it useful — and it needs to be constantly updated for ease of use through Alter syntax. 

In the latest 8.7 release of SingleStore, we are happy to introduce automatic schema and pipeline DDL inference that simplify the initial setup process, enabling users to quickly start leveraging their data and reduce time to insights for customers.

why-is-this-important-for-customersWhy is this important for customers?

For developers, schema inference means accelerated setup times for new databases and the ability to quickly run queries on their data. This automation minimizes errors, providing the flexibility to tailor schema and pipeline setups, enhancing developer productivity. By automatically analyzing the data's structure of the remote file, this helps generate necessary table schemas and pipeline definitions — recognizing column names, data types and file formats effortlessly.

This can then be either applied as is, or modified before applying and executing. The input_configuration may be a configuration for loading from Apache Kafka, Amazon S3, a local filesystem, Microsoft Azure, HDFS and Google Cloud Storage. For specification on the syntax, refer to CREATE PIPELINE.

Using schema inference in SingleStore involves simple commands that enable users to connect to data files and generate the schema definitions. Users can choose to review, modify and apply the suggestions provided.

how-to-use-the-new-functionalityHow to use the new functionality

Infer table/pipeline to generate and view schema suggestions

The following demonstrates the SQL commands that create the tables and pipelines. When run, this gives users the option to make any modifications before running the generated commands.

INPUT
INFER TABLE AS LOAD DATA S3 's3://data_folder/books.avro'
CONFIG '{"region":"<region_name>"}'
CREDENTIALS '{
"aws_access_key_id":"<your_access_key_id>",
"aws_secret_access_key":"<your_secret_access_key>",
"aws_session_token":"<your_session_token>"}'
FORMAT AVRO;
OUTPUT
"CREATE TABLE `infer_example_table` (
`id` int(11) NOT NULL,
`name` longtext CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`num_pages` int(11) NOT NULL,
`rating` double NULL,
`publish_date` bigint(20) NOT NULL)"
INFER PIPELINE AS LOAD DATA {input_configuration}

pipeline-inferencePipeline inference

INPUT
INFER PIPELINE AS LOAD DATA S3
's3://data_folder/books.avro'
CONFIG '{"region":"<region_name>"}'
CREDENTIALS '{
"aws_access_key_id":"<your_access_key_id>",
"aws_secret_access_key":"<your_secret_access_key>",
"aws_session_token":"<your_session_token>"}'
FORMAT AVRO;
OUTPUT
"CREATE TABLE `infer_example_table` (
`id` int(11) NOT NULL,
`name` longtext CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`num_pages` int(11) NOT NULL,
`rating` double NULL,
`publish_date` bigint(20) NOT NULL);
CREATE PIPELINE `infer_example_pipeline`
AS LOAD DATA S3 's3://data-folder/books.avro'
CONFIG '{\""region\"":\""us-west-2\""}'
CREDENTIALS '{\n \""aws_access_key_id\"":\""your_access_key_id\"",
\n \""aws_secret_access_key\"":\""your_secret_access_key\"",
\n \""aws_session_token\"":\""your_session_token\""}'
BATCH_INTERVAL 2500
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
INTO TABLE `infer_example_table`
FORMAT AVRO(
`id` <- `id`,
`name` <- `name`,
`num_pages` <- `num_pages`,
`rating` <- `rating`,
`publish_date` <- `publish_date`);"

create-inferred-pipelineCreate inferred pipeline

This creates the required table and pipeline for loading the data from the source.

CREATE INFERRED PIPELINE books_pipe AS LOAD DATA S3
's3://data_folder/books.avro'
CONFIG '{"region":"<region_name>"}'
CREDENTIALS '{
"aws_access_key_id":"<your_access_key_id>",
"aws_secret_access_key":"<your_secret_access_key>",
"aws_session_token":"<your_session_token>"}'
FORMAT AVRO;

whats-nextWhat’s next

SingleStore's schema and pipeline inference dramatically cuts down manual setup, accelerating time to insights. This is the first iteration of the schema inference for expanded support for additional file types like JSON and Parquet, further enhancing its data processing capabilities.

Schema detection is just the beginning of our investments to developer productivity and simplifying data ingestion problems. We plan to support JSON and Parquet data files with schema inference — and with support for source file schema changes, and schema evolution and inference capabilities. Dive in and experience the power of streamlined data processing firsthand. Create a free workspace, load your data and start your data journey.


Share