Integrating Spark with SingleStoreDB enables Spark to leverage the high-performance, real-time data processing capabilities of SingleStoreDB — making it well-suited for analytical use cases that require fast, accurate insights from large volumes of data.
The Hadoop ecosystem has been in existence for well over a decade. It features various tools and technologies includingHDFS (Hadoop Distributed File System), MapReduce, Hive, Pig, Spark and many more. These tools are designed to work together seamlessly and provide a comprehensive solution for big data processing and analysis.
However, there are some major issues with existing Hadoop environments, one of which is the complexity of the Hadoop ecosystem, making it challenging for users to set up and manage. Another issue is the high cost of maintaining and scaling Hadoop clusters, which can be a significant barrier to adoption for smaller organizations.
In addition, Hadoop has faced challenges in keeping up with the rapid pace of technological change and evolving user requirements — leading to some criticism of the platform's ability to remain relevant in the face of newer technologies.
The good news? Apache Spark can be used with a modern database like SingleStoreDB to overcome these challenges.
Apache Spark
Apache Spark is a popular tool for analytical use cases due to its ability to handle large-scale data processing with ease. It offers a variety of libraries and tools for data analysis, including Spark SQL, which allows users to run SQL queries on large datasets, as well as MLlib, a library for machine learning algorithms.
Spark's distributed nature makes it highly scalable, allowing it to process large volumes of data quickly and efficiently. Additionally, Spark Streaming enables real-time processing of data streams, making it well-suited for applications in areas like fraud detection, real-time analytics and monitoring.
Overall, Apache Spark's flexibility and powerful tools make it an excellent choice for analytical use cases, and it has been widely adopted in various industries including finance, healthcare, retail and more.
SingleStoreDB
SingleStoreDB is a real-time, distributed SQL database that stores and processes large volumes of data. It is capable of performing both OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) workloads on a unified engine, making it a versatile tool for a wide range of use cases.
Overall, SingleStoreDB's high-performance, distributed architecture — combined with its advanced analytical capabilities — makes it an excellent choice for analytical use cases including real-time analytics, business intelligence and data warehousing. It has been widely adopted by companies across finance, healthcare, retail, transportation, eCommerce, gaming and more. And, SingleStoreDB can be integrated with Apache Spark to enhance its analytical capabilities.
Using Apache Spark with SingleStoreDB
SingleStoreDB and Spark can be used together to accelerate analytics workloads by taking advantage of the computational power of Spark, together with the fast ingest and persistent storage of SingleStoreDB. The SingleStore-Spark Connector allows you to connect your Spark and SingleStoreDB environments. The connector supports both data loading and extraction from database tables and Spark DataFrames.
The connector is implemented as a native Spark SQL plugin, and supports Spark’s DataSource API. Spark SQL supports operating on a variety of data sources through the DataFrame interface, and the DataFrame API is the widely used framework for how Spark interacts with other systems.
In addition, the connector is a true Spark data source; it integrates with the Catalyst query optimizer, supports robust SQL pushdown and leverages SingleStoreDB LOAD DATA to accelerate ingest from Spark via compression.
Spark and SingleStoreDB can work together to accelerate parallel read and write operations. Spark can be used to perform data processing and analysis on large volumes of data, writing the results back to SingleStoreDB in parallel. This can be done using Spark's distributed computing capabilities, which allow it to divide data processing tasks into smaller chunks that can be processed in parallel across multiple nodes. By distributing the workload in this way, Spark can significantly reduce the time it takes to process large volumes of data and write the results back to SingleStoreDB.
Overall, by combining Spark's distributed computing capabilities with SingleStore's distributed architecture, it is possible to accelerate parallel read and write operations on large volumes of data, enabling real-time processing and analysis. The parallel read operation creates multiple Spark tasks, which can drastically improve performance.
The Spark-SingleStore connector also provides parallel read repartitioning features to ensure that each task reads approximately the same amount of data. In queries with top-level limit clauses, this option helps distribute the read task across multiple partitions so that all rows do not belong to a single partition.
Spark-SingleStoreDB Integration Architecture
Business Benefits
- Horizontally and vertically scalable
- Multi-tenant
- Integrates seamlessly with our systems
- 100% ANSI SQL
- Very easy to access (no special training required)
- Optimized Spark integration via a connector
- No need to rewrite existing Spark code (*)
- Intelligent distribution of storage and compute to achieve the best possible performance
Spark-SingleStoreDB Integration in Action
- Launch the Spark shell with the SingleStore-Spark connector, and its dependency
spark-shell --packages com.singlestore:singlestore-spark-connector_2.12:4.0.0-spark-3.2.0
- Load data from HDFS into the dataframe
val df_csv = spark.read.format("csv").option("header",
"true").load("hdfs://localhost:9000/test_data/MOCK_DATA.csv")
- Display data in the Spark shell
df_csv.show(false)
- Create the dataframe with transformation
val df = df_csv.withColumn("fullname",concat(col("first_name"),col("last_name")))
df.show(false)
- Load the dataframe from Spark to the SingleStoreDB table
df.write.format("singlestore").option("ddlEndpoint", "127.0.0.1:3306").option("user",
"root").option("password", "Singlestore@123").option("database",
"test").option("loadDataCompression", "LZ4").option("header","true").option("truncate" ,
"false").mode("overwrite").save("foo1")
- Read the SingleStoreDB table into the Spark dataframe
val df_s2 = spark.read.format("singlestore").option("ddlEndpoint",
"127.0.0.1:3306").option("user", "root").option("password",
"Singlestore@123").option("database", "test").load("foo1")
df_s2.show(false)
- Complete the following action to read the SingleStoreDB table in the Spark dataframe, with the filter condition
val df_s2_filter = spark.read.format("singlestore").option("ddlEndpoint",
"127.0.0.1:3306").option("user", "root").option("password",
"Singlestore@123").option("database", "test").load("foo1").filter(col("id") > 200)
df_s2_filter.show(false)
- Query the table with join
val df_join = spark.read.format("singlestore").option("ddlEndpoint",
"127.0.0.1:3306").option("user", "root").option("password",
"Singlestore@123").option("database", "test").option("query", "select e.empno, e.ename, e.job,
e.sal, d.deptno, d.dname from emp e INNER JOIN dept d ON e.deptno = d.deptno").load()
df_join.show()
- Query the table with pushdown disabled
val df_s2 = spark.read.format("singlestore").option("ddlEndpoint",
"127.0.0.1:3306").option("user", "root").option("password",
"Singlestore@123").option("database", "test").option("disablePushdown",
"true").load("foo1").filter(col("id") > 200).show(false)
In the previous examples, we have used Scala code to fetch data from SingleStoreDB.
The following outlines the steps for Java implementation:
- Build a jar using java code and pom.xml
cd /home/ubuntu/maventest/SparkSingleStoreConnection/
mvn clean package
In our setup, we have built the jar using main.java and pom.xml. The path of main.java is: /home/ubuntu/maventest/SparkSingleStoreConnection/src/main/java/example
The path of pm.xml is: /home/ubuntu/maventest/SparkSingleStoreConnection
- Execute the jar file using the following command. The path of sample jar is: /home/ubuntu/maventest/SparkSingleStoreConnection/target
spark-submit --class main SparkSingleStoreConnection-1.0-SNAPSHOT.jar --master
yarn --driver-memory 2G --executor-memory 2G --deploy-mode cluster --executor-cores 2
--num-executors 2
Summary
Overall, the Spark-SingleStore integration enables Spark to leverage the high-performance, real-time data processing capabilities of SingleStoreDB — delivering analytical use cases that require fast, accurate insights from large volumes of data.
If you are interested in reading further — or testing out the integration for yourself — further check out these SingleStore blog posts and Github repos: