This article introduces distributed databases, explains why you might want to distribute your database and expands on some of the more relevant ideas in database distribution.
Table of Contents
What Is a Distributed Database?
What Is a Distributed Database System?
Why Distribute Databases?
Performance
Scalability
Reliability
Transparency
Distributed Database Use Cases
What Are the Characteristics of Distributed Databases?
Distributed Data Storage
Distributed Query Execution
Security
Ease of Integration
Cost Optimization
What Is a Distributed Database?
Distribution, or integrating the processing capabilities of several machines to present their functions as if they were a single entity, is quite prevalent in modern computing. You’ve likely interacted with a distributed system and not realized it. For instance, social media applications offer their users a behavioral experience suggestive of the capability of a single powerful computer when in fact, the underlying technology is anything but.
Platforms like Twitter and Facebook run at scale and are built on top of tools that are sharded (fragmented) across several computing elements, often in different locations, and integrated into neat, easily browsable applications. Users are often completely unaware of the underlying architecture.
Databases feature prominently in application development and benefit immensely from distribution. Distributed databases, as opposed to those strictly siloed to a single machine, are scalable, performant and offer data transparency. They effectively eliminate the need for you to know what is happening under the hood. By extension, the applications that are connected to distributed databases provide a better user experience as well.
This article introduces distributed databases, explains why you might want to distribute your database and expands on some of the more relevant ideas in database distribution.
What Is a Distributed Database System?
Distributed databases are essentially distributed systems, where the computing devices or processing elements (PEs) in the system communicate over some network (like the internet) and synchronize to present the functions of several components as a singular function to a user.
Databases generally exist to serve as repositories for data. When distributed, the data they store, along with the artifacts that make storing and retrieving data possible, are replicated across several computers or sites. You can think of distributed databases as collections of logically related data and artifacts located at the processing devices of distributed systems. For instance when a user attempts to retrieve data from a distributed database via some query, the result is a single query-relevant data set that presents a unified view of the data store — even though architecturally, it’s a composite system.
Distributed databases are, without all the fancy replication and fragmentation strategies - databases. Those of the Structured Query Language (SQL) variety like SingleStore atomically distribute several indexes and transactions (artifacts tightly woven into the SQL mold) along with various data retrievable via query.
Indexes are artifacts that generally increase read performance of database queries. They leverage what are known as hashtables, key-value data structures that allow for data access in constant time, to increase the speed and efficiency of database reads. Often invoked behind-the-scenes, the performance impact of indexes can be significant, especially in situations that condition concurrent user access.
Transactions, on the other hand, are units of work (typically sequences of queries) that are executed in such a way that all the actions performed by a single user — each action in the sequence — has a managed cascading effect for all users. Transactions are processed in such a way that every operation conveys to all database users: an update made by one user to say — their username — is rendered visible, upon update to everyone who interacts with that user’s profile.
Users of Facebook can post, like and share on a whim because of its distributed database. Like SingleStore, it’s fragmented across several geographical locations to give each user an experience that presents sole, central access.
Before proceeding, it’s worth mentioning that users interact more directly with distributed database management systems (DDBMS) that convey a database’s underlying distribution to them.
Why Distribute Databases?
Modern single-node systems with centralized data storage were designed to overcome a traditional data dependence between databases and applications, freeing data from the logical and physical constraints of where it’s stored. However, these single-node systems now present a different problem. The centralization in such setups is not a product of combining several node fragments but is instead a pure function of the host machine's processing capabilities. As such, all functions, performance, data sizes and additional scaling are constrained by computational capacities — leading to a quick deterioration, event-to-insight response and rising operational costs.
Though multiple access and upward vertical scaling are technically possible with single-node setups, without distribution, the lack of fault tolerance from having a single corruptible point of failure (the lone system) and the processing limits of a single node can be very restrictive considering the heightened scale of data storage today.
Distributed databases overcome the performance limitations of single-node systems by providing data centralization via the integration of several site-distributed data elements. Though scale and performance are probably the most obvious benefits of distributed systems, there’s also significant value in their reliability and data transparency.
Performance
The improved performance of distributed databases has two dimensions: fragmentation and parallelism. Distributed databases exist as data fragments stored across a network, in sites close to their points of use. So, a retailer with branches in multiple locations will store relevant branch-specific data in a part of the network as close as possible to where the data will be used. The locality of the data — its proximity to the point of use — reduces latency in distributed setups and often translates to faster querying.
Parallelism is a computing model built into the function of distributed databases. Running processes in parallel often means running them together, almost precisely simultaneously (and often in interleaved sequences), to increase the rate at which process-resultant data is generated. When you play a video on Facebook while actively chatting with a friend in an open chat window, you are effectively enjoying some form of parallelism. Distributed databases, on a granular level, run queries in parallel to produce query-relevant data sets and present significantly faster data access to users.
Scalability
Scaling, or more specifically scaling-out, is often associated with performance and is readily achievable with distribution. Also known as horizontal scaling, scaling-out involves increasing the power of a distributed system — in this case, a distributed database — by incrementing its processing capabilities and storage capacity, often through adding performant computing devices. Distributed databases can achieve linear increments in processing power by adding more database servers (with more CPU and disk capacity) to a network. The scaling power here is expressed in the collective processing capabilities of the network.
Reliability
In distributed database systems, availability and reliability are usually guaranteed by replication, which is the necessary duplication of database artifacts at various points within the system. There’s no single point of failure in a distributed database. So, whenever a communication channel or network component is compromised in a distributed system, a user receives one of several replicas of queried data, which otherwise would have been rendered inaccessible by the fault. This replication ensures that distributed databases are persistently available on demand. More importantly, it also ensures that each user perceives that they have exclusive access to data despite being one of many users interacting with replicas. It’s as if they were working on a non-networked machine with single direct access to a database server.
Transparency
Transparency is a technical concept that aptly describes the separation of concerns that distributed databases offer their users. Distributed databases are designed to hide their complex internal mechanisms from their users and present only the data requested. At the same time, users interested in said mechanisms can use a distributed database’s high-level query language, which can express storage and query rules.
Transparency builds on the concept of data independence and makes it so that a user sees a unified view of the data they ask for regardless of its logical and physical representations. Distributed databases are highly compartmentalized. They present to architects the finer details of their replication, fragmentation and integration over vast wide area networks (WANs) and give users centralized views of the data in their queries.
Distributed Database Use Cases
Due to the strengths previously highlighted, distributed database systems are commonly used in data-intensive endeavors such as fraud detection systems, real-time data analytics and facial recognition.
Distributed databases are used as data stores in fraud detection, the most computationally intense forms of which typically involve machine learning to identify, monitor and coordinate responses to incidents of deceptive malfeasance. Such systems utilize the fault tolerance of distributed databases to manage large ledgers of identified threats.
With real-time analytics, you’re primarily concerned with acquiring insights into data the instant it’s collected. Real-time dashboards in which visualizations of data aggregated from several sources appear mere moments after its processing draw a lot of their speed from their proximity to any one of several replicas in a distributed system.
Facial recognition, much like real-time analytics, works best with the speed and presence of data ensured by nearby replicas in a distributed database. Facial recognition is a machine-learning approach to matching faces in pictorial content—an image, video frame snapshot, or similar — against several images in a vast image database for the purpose of identifying individuals.
What Are the Characteristics of Distributed Databases?
Distributed databases present significant value derived from their ability to overcome unique design issues. Aside from not being siloed to a single node, a distributed database system must be able to address the challenges related to replicated data storage and fragmented query execution. It must also be generally secure, easy to integrate into applications and cost-effective.
Distributed Data Storage
Data storage in distributed database systems should address the reliability and consistency demands of distribution and satisfy any arbitrary rules expressed in an appropriate data management language. Reliability can be viewed as a measure of a distributed system’s uptime. In the event of failure, distributed database systems should be able to restore function at failed sites. Consistency refers to the coherence of data transferred between the connected units in a distributed system at a given point in time.
Distributed systems should ensure that their fragments are properly synchronized to allow users to access the same data via the same query, irrespective of their location. The overall consistency of a distributed database system is determined by how synchronized each replica is, which affects the accuracy and up-to-dateness of data available to users.
Distributed database systems need to give their users efficient ways to interact with the data they store and control the conditions for storing that data. Modern data management technology addresses this by collecting, processing, validating, storing, and integrating data from several sources. Relational databases are the standard for data management, and SQL has been the most popular relational database syntax, with several vendor offshoots. While the technology has been criticized for its "one size fits all" approach with little deviation from what is a mostly rigid syntax, SQL remains prominent to this day.
Other modern data management strategies include NoSQL, NewSQL and distributed SQL.
- NoSQL, short for "Not only SQL," consists of data stores and models that are not SQL-compliant. This technology was developed as a response to the rigidity of SQL and emphasizes fault tolerance and uptime, almost always at the expense of consistency.
- NewSQL combines the scalability of NoSQL with the consistency of traditional SQL.
- Distributed SQL database management systems like SingleStoreDBreplicate a single relational database throughout a network, across multiple nodes. They use a SQL API like traditional single-node SQL database management systems but scale and perform significantly better.
Distributed Query Execution
Fragmentation is a central aspect of distributed databases. While fragmentation ensures that data is closest to the user who requests it, the concept makes the necessary integration of granular query and data artifacts (fragments) rather challenging. The first challenge is performance-related because distribution has a heavy cost in terms of disk, CPU, and communications network throughput. The second challenge is the integration of database artifacts to effectively execute a high-level user query. This problem defines the difficulty in representing, at each site, high-level query language as granular data manipulation operations.
A distributed database system should be able to efficiently convert its high-level query syntax to small data manipulation operations relative to its fragments. Distributed query processing (or execution) selects the most optimal integration strategy for high-level queries. Due to the complexity of the process, distributed database systems use a query processor to decompose (and thus fragment) high-level SQL into smaller database operators that are easier to execute and that use the available network I/O and CPU resources more efficiently.
Security
As distributed database systems are commercial software, they’re required to provide secure access predicated on rigorous authentication and authorization. Since security is one of the main functions of the cloud through which distributed databases like SingleStore are delivered, any enterprise-grade distributed database system should secure its data behind the requisite layers of authentication and authorization.
Ease of Integration
Distributed databases are vital to the development and use of applications, especially data-intensive ones. As such, they should be able to be queried by the applications whose data they store. A unifying API such as SQL makes data retrieval, update, storage and deletion — all CRUD operations a breeze for developers who proxy their database queries through object–relational mappers (ORMs).
Cost Optimization
Distributed database systems should seek to give their users, especially database administrators, some control in managing critical database system operations such as server upgrades and data audits. Providers of distributed database management systems should seek to provide insights into optimal queries and other related artifacts. Optimization, in this case, can help users further reduce their costs, especially if said providers offer affordable subscription fees for cloud storage and prevent the need to run an expensive, sophisticated data operation.
Conclusion
Distributed databases are essentially collections of logically related data and database artifacts located at the nodes of a distributed system, also known as sites. Distributed databases overcome the logical and physical data dependence present in early computing models and transcend the scale and processing limitations of more centralized databases operating in the processing confines of a single node. Distributed databases have four major distinctive advantages — scaling, performance, reliability and transparency — that make them ideal for data-intensive real-time fraud detection, analytics, facial recognition and more.
Not every networked database is distributed, however, as the preconditions of distributed data storage and query execution, security, a unified query API and cost optimization must be met for any categorization to be made.
SingleStoreDB meets the criteria specified in the previous section and a real-time, distributed SQL database for multi-cloud, hybrid and on-premise uses — bringing you speed, scale and simplicity.