At Strata+Hadoop World, SingleStore Software Engineer, John Bowler shared two ways of making production data pipelines in SingleStore:
**1) Using Spark for general purpose computation
- Through a transform defined in SingleStore pipeline for general purpose computation**
In the video below, John runs a live demonstration of SingleStore and Apache Spark for entity resolution and fraud detection across a dataset composed of a hundred thousand employees and fifty million customers. John uses SingleStore and writes a Spark job along with an open source entity resolution library called Duke to sort through and score combinations of customer and employee data.
SingleStore makes this possible by reducing network overhead through the SingleStore Spark Connector along with native geospatial capabilities. John finds the top 10 million flagged customer and employee pairs across 5 trillion possible combinations in only three minutes. Finally, John uses SingleStore Pipelines and TensorFlow to write a machine learning Python script that accurately identifies thousands of handwritten numbers after training the model in seconds.
Get the SingleStore Spark Connector Guide
The 79 page guide covers how to design, build, and deploy Spark applications using the SingleStore Spark Connector. Inside, you will find code samples to help you get started and performance recommendations for your production-ready Apache Spark and SingleStore implementations.
Download
Watch the Video Recording:
About the Speaker
John Bowler, is a Software Engineer at SingleStore. John has a background in machine learning, algorithms, and distributed data warehouses. John is a graduate of MIT who previously interned at SpaceX where he helped write control algorithms for the SuperDraco rocket engine.