The era of open data lakehouses is here.
While data lakehouses gained popularity over the last decade, the biggest lakehouse vendors have embraced Apache Iceberg as the de facto standard. We believe this will revolutionize how companies think about storing and querying vast amounts of structured and unstructured data. While these developments are trending in the right direction, you still need complex and costly ETL to power user-facing, real-time applications.
Earlier this year, we announced native data integration services for MySQL and MongoDB® to support high-performance apps built on popular open-source databases. Many SingleStore customers want the same for their massive data lakes.
Announcing: Bi-directional integration for Apache Iceberg
Today, we're excited to announce SingleStore’s integration with Apache Iceberg (in public preview). You can now use SingleStore to seamlessly ingest data from and write back to Iceberg tables — in real time, with no additional tooling required:
- Zero ETL support from Iceberg (public preview). Data changes from Iceberg tables can be ingested directly into SingleStore as soon as they are added to the source tables, ensuring that analytics and applications are powered by the most current data. This eliminates the costs of additional ETL tooling.
- Automatic schema consistency. SingleStore automatically manages schema evolution/changes and updates, ensuring changes in Iceberg tables are seamlessly reflected in SingleStore without requiring manual intervention.
- Performance and scalability. Leveraging SingleStore's high-performance architecture, the zero ETL solution ensures data ingestion is efficient even at scale, maintaining low latency and high throughput.
- Bi-directional integration (private preview). SingleStore supports write-to-iceberg tables with support for Glue catalog for seamless data sharing.
- Data discovery and metadata management. Powered by catalogs like AWS Glue or Snowflake Data Catalog (to begin with), and Polaris and Hive Meta Store (coming soon).
We’re going beyond simply being able to query Iceberg tables in place. Our goal is to deliver frictionless, bi-directional data-sharing between SingleStore and Iceberg tables so you can power low-latency apps and analytics on Snowflake, Cloudera or any open data lake. We plan to make available the following capabilities later this year:
- External Iceberg tables. You will be able to query Iceberg data directly without creating a local copy. This feature allows users to access and analyze data stored in Apache Iceberg tables seamlessly from SingleStore, eliminating the need for data duplication and reducing storage overhead.
- Egress data to Iceberg. General availability to export data from SingleStore to Iceberg tables. Transferring processed or transformed data from SingleStore to Iceberg-managed tables enables bi-directional data flow, and enhances data management and interoperability.
The benefits of adopting SingleStore’s zero ETL solution for Iceberg extend its technical efficiencies:
- Low-latency operational engine. Delivers super low-latency analytics on lakehouse data.
- Fast ingest from Iceberg data lakes. Eliminates the need to use complex ETL tools for data ingestion, simplifying the architecture.
- Bi-directional data flow. Helps with efficient data sharing between the Iceberg lakehouse and SingleStore operating on a common lakehouse.
How to use zero ETL with SingleStore + Iceberg
SingleStore leverages its Pipelines feature and distributed system architecture to efficiently ingest data from Iceberg tables. This zero ETL functionality can be used with our Free Starter Workspace.
Step 1. Head over to singlestore.com to start free.
Step 2. Create Iceberg tables for required data
SingleStore supports multiple, well-known Iceberg catalogs like Glue, REST and Snowflake. Ensure you have an Iceberg table created or migrated using one of those with a storage layer like S3. For example, using spark-sql and the following Iceberg docs.
Step 3. Create a pipeline to ingest data from your Iceberg tables
The CREATE PIPELINE statement will create a pipeline object that manages connectivity between SingleStore and Iceberg catalog/storage to pull data for the Iceberg table.
CREATE PIPELINE ... as load data DATA_SOURCECONFIG '{<catalog_config>, <file_io_config>}'CREDENTIALS '{<file_io_creds>}' INTO ... FORMAT ICEBERG;
- <catalog_config>: Includes <table_id> and other options to connect to the Iceberg catalog and read table metadata.
- <file_io_config> and <file_io_creds>: To connect to the data source or storage layer and read table data as in existing SingleStore pipelines.
ETL flow
- Aggregator. Handles table metadata requests like reading schema, partitioning info, snapshots and schedules batches to run.
- Leaf nodes. Process batch partitions to consume Iceberg table data stored in Parquet format and handle schema evolution.
In addition, Iceberg pipeline users can apply all existing parquet pipeline data manipulation logic.
Example:
CREATE PIPELINE addresses_pipe ASLOAD DATA S3 ''CONFIG '{"region" : "us-west-2","catalog_type": "SNOWFLAKE","table_id": "db_name.schema_name.table_name""catalog.uri":"jdbc:snowflake://<acount_identifier>.snowflakecomputing.com""catalog.jdbc.user":"<user_name>","catalog.jdbc.password":<password>","catalog.jdbc.role":"<user_role>}'CREDENTIALS '{"aws_access_key_id" : "<your_access_key_id>","aws_secret_access_key": "<your_secretaccess_key>"}'INTO TABLE addresses(Id <- id, Name <- name, Street <- address::street, City <- address::city)FORMAT ICEBERG;
For detailed configuration steps and examples, refer to our Iceberg ingest documentation.
Take your first step today
The integration journey of Apache Iceberg with SingleStore marks a significant milestone in simplifying data integration and management. This provides the simplest way to integrate SingleStore with your Iceberg data for optimum speed and performance.
--> Watch the demo
--> Read our documentation
--> Stay tuned for more Iceberg related product updates