This article describes what streaming data is, its benefits and pitfalls — as well as use cases and the architecture you need to get started.
Table of Contents
- What Is Streaming Data?
- How Does Streaming Data Work?
- The Benefits of Streaming Data
- Use Cases for Streaming Data
- Financial Trading
- Business Analytics
- Security Systems
- Retail and Inventory
- Working with Streaming Data
- Event Sources
- Ingestion Systems
- Stream Processing Systems
- Data Sink
- SingleStoreDB
- Conclusion
What Is Streaming Data?
Streaming data is fast becoming the new standard for data-driven organizations. Instead of being processed in batches, streamed data gets processed at the time of creation.
This paradigm is upending various industries. Streaming data creates new opportunities for IT departments, business users and customers, whether in finance or retail. Staying ahead of the curve requires that you understand how stream processing works.
This article describes what streaming data is, and outlines its benefits and pitfalls. You'll learn about a variety of use cases and about the architecture you need to get started.
How Does Streaming Data Work?
When large volumes of data are involved, organizations tend to process them in batches. Overnight jobs clean, transform, enrich and optimize data that was generated that same day and make it available in a multitude of applications such as CRMs, marketing automation systems, CMSs, analytics platforms and ERPs.
However, due to the delays in decision-making it causes, batch processing is also an anti-pattern. In recent years, cloud computing has made it possible to free up the necessary memory to process data more or less at the point of its creation. Data that's processed in real time is known as streaming data, and the workflow that processes it is known as a stream processing pipeline.
Typically, a specific action creates a piece of data known as an event. This event is ingested in a pipeline before sanity checks, transformations and/or enrichments are applied to it, thus enabling real-time actions to be taken by downstream applications.
Say you own a platform for managing crypto assets. Every time a user logs in, an event is created and processed in a streaming data pipeline. Several things happen in parallel:
- The usage pattern encoded in the event is scored by an algorithm to indicate if potential fraud is involved. If the event passes the test, the server validates the login.
- The event is streamed into the data warehouse for analytical purposes.
- The marketing automation platform is triggered to send an email and push notification to the user to notify that a new device has attempted to log in.
There are no point-to-point integrations involved here. All the processes are triggered by the same event. This architectural design as known as Kappa architecture. The following section outlines the benefits of this approach.
The Benefits of Streaming Data
Here are multiple advantages of working with streaming data:
Customers get real-time feedback: Because data is processed in real time, customers or users don't have to wait for information to arrive. Their actions get instant feedback, often resulting in higher customer satisfaction.
Analysts can access data in real time: An organization's internal customers also stand to gain. Data analysts and scientists get to work with the latest data; their insights are always up-to-date.
Business users can act quicker: It's not only data professionals who are impacted by streaming data. When data is processed and visualized adequately, managers can make decisions while events are still unfolding.
Reduces memory utilization: The IT department will also be happy. In the past, huge clusters needed to be set up to process huge batches of data. When set up adequately, streaming architectures can drastically reduce the need for ever-growing memory utilization.
No spaghetti architecture: In a traditional setup, the number of integrations grows exponentially, while in Kappa architecture, it grows in a linear fashion. Consequently, organizations can drastically reduce their operational expenditures for maintaining integrations in their tech stack.
Use Cases for Streaming Data
Now that you understand the various benefits of streaming data, it's time to move to use cases. This section presents four typical use cases for implementing Kappa architecture.
Financial Trading
There's no sector where speed and synchronization matter like in financial trading. Not only are traders always on the lookout for arbitrage opportunities, they also have to keep all their systems in sync. This ensures that they get constant feedback on their buying and selling behavior and that their terminals are up-to-date.
Business Analytics
Real-time analytics benefit numerous business sectors as detecting operational difficulties has a direct impact on the bottom line. But real-time data can also be employed in more creative ways. When a CRM is directly plugged into a data stream, sales representatives can know exactly what their clients have been looking at online.
Security Systems
Cybersecurity is another important sector where quick action is crucial. Instead of having to wait for manual or automated recurring tests to finish, through data streams, organizations can generate events and monitor them continuously through metrics. When these metrics deviate too strongly from a baseline, automated actions can be triggered to contain a possible breach.
Retail and Inventory
Many retail chains calculate stocks recurrently. What if inventory could be checked and refilled in real time? That's where data streams come in. ERP systems can listen to a data stream to monitor all incoming and outgoing goods and take action automatically: they can adjust prices, place an outgoing order, remove the product from the website, etc.
Working with Streaming Data
This following outlines the fundamental differences between streaming and batch data, and how distinct databases cope with each.
Event Sources
Many systems produce a continuous and never-ending stream of data. While there might be a beginning to the stream, it hasn't ended when you're processing it. This stream of events is also known as unbounded data. This is in contrast to bounded data, which has a beginning and an end.
For example, if you're loading all medal winners from the 2016 Summer Olympics, you know you'll load a table of 972 rows. But keep in mind that most data out there is unbounded. The way humans divide it in chunks is often arbitrary: even in our example, every two years, the IOC awards new Olympic medals.
This illustrates how working with streaming data is not only technically different but also requires a different mindset. One can't think about data in batches when data is constantly being produced, processed and stored.
Ingestion Systems
Ingestion systems are a very useful type of middleware with capabilities to manage many events. They capture and store the messages in order and for a fixed period in a distributed file system. This temporary storage enables rewinding or message replay. When something breaks downstream, the whole stream can be reproduced or reanalyzed, which is especially important for big data use cases.
Stream Processing Systems
A stream processing system, or stream analytics system, can run queries on continuous streams of data, often through windows. Three types of windowing systems are relevant:
- Tumbling windows: These are partitions of non-overlapping chunks of data of the same length.
- Sliding windows: The windows are of fixed length but are slid over the stream at an interval, usually smaller than the window size. The result is that events can belong to multiple windows.
- Session windows: Finally, windows could be of various sizes based on the use case. This is relevant for ad hoc queries.
Processing data streams at sub-second latency from the ingestion time allows for the detection of anomalies and patterns almost as soon as they're created.
Data Sink
Finally, most organizations permanently store the data streams for reporting, analytical, predictive and even operational use cases. For cold storage, object storage or data lakes are used. For analytical workloads, data warehouses tend to be more suitable. And for operational workloads, relational systems are most suitable. For some use cases, being able to query and aggregate data at rest and data in flight together is relevant.
That's where the streaming database comes in.
Streaming databases are rather unique compared to traditional relational database management systems (RDBMS). Traditional databases ingest data, which can then be processed. Streaming databases bridge two software categories: the stream processing system and the data sink. This unique feature enables the ingestion of streaming data to immediately use it to update the results of any registered queries or views.
SingleStoreDB
SingleStoreDB is a hybrid transactional/analytical processing system (HTAP that can run self-managed or as DBaaS, and is ideal to handle streaming data and analytics
Like a data warehouse, it stores data in a columnar format. However, it also has in-memory row-based storage that acts as a lock-free skip list in front of the columnstore. This skip list stores events from the data stream until it contains enough events to fill an entire segment. For read queries, the skip list is just another segment, indistinguishable from a segment stored on disk, treating streaming data as a first-class citizen, both for operational and analytical workloads.
Conclusion
In this article, you've learned what streaming data is and what it can be helpful for. Furthermore, you should now understand how streaming data is processed for various usage patterns. Finally, an overview of what you need from a database solution should put you on the right track.
If you're looking for a managed database that can handle OLTP and OLAP workloads with streaming data, take a look at SingleStoreDB — and enjoy $600 of free credits.