SingleStore Aura Container Service: A Serverless Platform Built for AI

Last year in August, we quietly rolled out Aura Container service.

This powers our notebooks and Scheduled Jobs, switching from running Jupyter servers in AWS EC2 instances to our own custom built platform — giving users a more flexible, optimized solution that can be used for different workloads with different computation requirements. We ensured certain targets for availability had been met for the last few weeks so we can begin moving toward a scenario where we can guarantee SLAs for customers using Aura to power mission critical workloads.

We initially envisioned Aura Container Service to be a general purpose compute service that can run any containerized applications in a serverless fashion, taking advantage of the speed and scale SingleStore offers. The first set of applications we started with are our interactive notebooks, which expand on SingleStore’s core capabilities by allowing analysts to write, organize and run not only Python code but also SQL queries.

How it works

Aura works by having a warm pool of containers the platform can assign on-demand to any requestors in under a second. These containers run with certain libraries pre-installed, so users can jump right into coding.

When using SingleStore Notebooks, you don’t have to worry about manually setting up the authentication to connect to your SingleStore database. Notebooks automatically uses the JWT token from the logged-in user to access the selected database, ensuring authentication and authorization is done without requiring any manual configuration.

We also introduced SingleStore Fusion SQL commands, which are SQL statements that can be used to manage workspace groups, workspaces, files in workspace STAGE and other resources that previously could only be managed in the portal user interface or Management REST API. The SingleStore Python client is responsible for intercepting all SQL statements and forwarding them to the database — or converting them into Management REST API requests accordingly. The same operations that are available via Fusion SQL are also available via our Python SDK, which simplifies the usage of the Management REST API in Python code.

Our engineering teams have been consistently working on making this experience as seamless as possible — you can read more about it in our engineering blog.

We have steadily seen growth in usage — nearly 9x — in Aura Compute Service powering notebook sessions.

What are we doing now, and how does it benefit our users?

In addition to the cost savings from the serverless nature of usage, Aura Container Service offers several significant benefits for end users:

One of the most notable benefits is the elimination of latency issues commonly associated with traditional container deployment. By maintaining a pool of pre-warmed containers, Aura enables sub-second container acquisition times, allowing users to begin working immediately without the traditional wait times for container startup and initialization. This instant availability dramatically improves workflow efficiency — especially for data scientists and analysts who need quick access to computational resources.

We are adding support for Aura runtimes in multiple regions. This is particularly important for European startups that need to maintain compliance with EU data laws, while leveraging the benefits of serverless computing.

Having multiple regions ensures we always have one to switch traffic over to in the future, providing uninterrupted service (even during regional outages). This multi-region architecture creates a robust failover system that maintains high availability for mission-critical workloads, giving users confidence their applications will remain accessible regardless of isolated infrastructure issues.

By design, Aura delivers runtime security. We provide complete compute compute and network isolation when we host your workloads in Aura runtime.

Where is the market headed?

With all this in place, we wanted to give users a way to take advantage of the serverless nature of Aura and provide a quick, easy way to write custom code in SingleStore Aura Notebooks, deploying functions defined there at the click of a button. Cloud functions remove the compute management to offer a serverless, cloud-based service that allows users to run code without managing compute resources. That means cloud functions offer programmatic access to not just SingleStore, but any data source of your choice. We expect this to play a crucial role in building functions that can be used as tools for building agentic applications.

We envision most agentic application workflows to fall in either knowledge-based workflows, where the interaction is meant to direct the agent to go out and fetch relevant information and either communicate to the user in a ChatGPT-esque interface, or trigger an action like updating a file or invoking some other action. The other major agentic workflows are action oriented, where LLMs with all the tools and knowledge take a series of tasks based on the results of a previous step for achieving a specific objective, like webscraping, SQL agent, sales agent, etc.

But what we have come to realize is that all these agentic applications need:

Foundational Large Language Models (LLM), a commodity product in the current market that serves as the brains of the applications and the GPU/TPU compute to run it on
A fast database as the knowledge store for Retrieval Augmented Generation (RAG)
A general purpose compute where all the tools and connectors for data sources run
Tools and connectors, with programmatic logic that operate in non-probabilistic ways to extend the capability of LLM models
Orchestration tools, including everything related to developer experience that makes it easier to leverage everything we previously mentioned

Lessons learned and future directions

The open-source revolution

Open source is undeniably the way forward in the AI landscape. Models like Llama, Mistral and DeepSeek ensure everyone can benefit from having better models. DeepSeek's recent impact on the U.S. stock market proves that having knowledge of building and tuning foundational LLM models offers no moat when a scrappy startup can emerge from nowhere, release an open-source LLM model for anyone to run and match some of the frontier models in the industry.

This democratization of AI technology is reshaping how businesses approach AI integration, making powerful capabilities accessible to organizations of all sizes.

Our plans for Aura include supporting more heavy compute and GPU instances that will enable our platform to run these models natively. Currently, we're running quantized versions to iron out all the authentication and performance issues, but our roadmap includes expanding to support the full range of model sizes and types. GPU support is critical as AI workloads continue to demand more computational power.

Our pre-configured container images come with NVIDIA drivers pre-installed, making it seamless to run GPU-accelerated workloads without additional setup — allowing developers to focus on their code rather than infrastructure concerns.

Server(less?) optimizations

While Aura is designed for serverless runtime allocation with sub-second container provisioning, certain workloads that require heavy dependencies like extensive pip installs or other initialization processes within cloud functions increase the latency of serving the first request. To optimize the experience, users have the flexibility to adjust the idle time of the session — allowing you to fine-tune the balance between serverless cost efficiency and the performance needs of your workload.

To address this, we're working on allowing users to select and configure idle timeout settings. This flexibility enables users to balance the cost benefits of serverless with the performance advantages of persistent containers, tailoring the service to their specific workload patterns.

We're also exploring options to allow customers to bring their own container images and run full scalable applications with Aura. This enables customers to write code in any language and run it with minimal configuration, expanding the versatility of our platform. We are experimenting with hosting and running a few open-source images and testing how it runs within our platform and have seen favorable outcomes.

Integrated AI platform

We previously explored running an obsolete hosting within Aura for building embedding functions, but found that ingesting data either through SingleStore Pipelines or with large-scale embeddings led to bottlenecks, and throughput with large data volumes wasn't sufficient.

We've now rebuilt this hosting service from scratch through our inferenceAPI service, which we expect will overcome these performance issues. Initial results have been promising, showing significant improvements in throughput and latency.

By bringing together LLM models, embedding services and database capabilities on the same platform, we're creating several advantages:

Keeping data within a single environment reduces the attack surface and simplifies security management. Our platform implements robust security isolation for each tenant, allowing safe execution of untrusted code or LLM-generated content.
Eliminating the need to move high-dimensional vectors across different providers significantly reduces latency and improves overall system performance. This is particularly important for vector search applications where milliseconds matter. Delegating the task for vector management to the platform also reduces the actual implementation timeline, and ensures our vector search customers can focus on implementing business logic — rather than deal with tedious infrastructure configurations.
Having all components in one place streamlines the development process, reducing the complexity of managing multiple services and authentication mechanisms. This integrated approach allows developers to focus on building applications rather than managing infrastructure.

With these benefits in mind, our first improvement aims to to provide an embedding management service for our vector search customers. This eliminates the need to move high-dimensional vectors across different providers, allowing customers to bring their own custom embedding and LLM models to run in our VPC.

We are exploring several areas where our platform can serve evolving AI needs. We're super excited to explore the cutting edge of the backend and AI industry. Look forward to more detailed engineering blogs where we delve deeper into all these topics, to be published in the near future.