For Developers, By Developers
Sometimes, if you’re really lucky, you get unsolicited raves from developers who just love what your technology does for them.
It’s amazing to be able to showcase opinions that developers write to share with other developers. We recently had this experience with Jack Ellis from Fathom, and this week, we received this contribution from Ero Carrera, Software Developer and Ex-Googler. We’re grateful to Ero for this guest blog post, where he shares his experience working with Modern Databases, how he discovered SingleStore experience with SingleStore, and how it’s been such a good fit for his engineering work.
In Ero’s words:
First, some background:
I have been working with SQL on-and-off for over 20 years. Over that time, I spent uncountable hours with Oracle, MySQL, PostgreSQL, and SQLite, optimizing queries, backends, and designing schemas. While I learned a lot fiddling with different databases, a lot of effort went into setting them up, keeping them up to date, backing them up, and optimizing their settings to get the most out of each solution.
These last ten years I worked at Google, where I was lucky to experience one of the most incredible technology stacks in the industry. As an engineer, the internal database technologies allow you to nearly forget about the backend and just use it, at ridiculously large scales, without having to care about storage or even optimizing queries much. Being able to just focus on building systems is something really easy to get addicted to.
Some of the projects I worked on were on intelligence analysis for cybersecurity. On those it was necessary to find and analyze relationships between different datasets of indicators of malicious activity. Each of the datasets was accessible via an interface using the “Google flavored” SQL, what is known outside Google as BigQuery. To read the datasets one could simply write queries to produce and massage the data into the desired format and subsequently process that, no need to worry about the SQL engine, networking, or storage (for the most part). Each of those datasets were in the tens to hundreds of Terabytes (many billions of rows). Working at that scale and being able to forget about the stack of technologies (of course being conscious of performance and following best-practices) was simply incredible.
Database Discovery
Upon leaving Google last year, I was afraid about the state of affairs in the “real world”, whether I might have gotten too used to those fancy toys. I started taking a look at what’s available. I was hoping to find something of industrial strength that made my life as easy as possible, but still had all the bells and whistles of modern databases. My needs were leaning towards the relational side, while I love key-value databases, I needed to be able to run analytics on structured data, so I wanted something “SQL native”.
My datasets were nowhere close to those I had at Google but I still wanted speed and the fast query times that enable “interactive research”. For my current projects I have time series data, where I need to join multiple tables, with complex nested queries, to extract aggregate statistics. The tables are in the tens of millions of rows, which are joined and grouped-by in several rounds. In the pipeline I’ve built so far I still haven't found a case where SingleStore doesn’t return results in near-real time.
A good friend who was starting his second startup recommended ClickHouse for time-series data. I tried to play with it, but bumped into some friction setting up a test environment and then having to interface it with their own libraries. I needed something easy to maintain, with a very comfortable interface, and eventually found a great analysis, Selecting a Database for an Algorithmic Trading System, that convinced me to try SingleStore (still going by MemSQL in the article).
I was drawn to it by how easy SingleStore is to interface with, just use the standard MySQL libraries and interfaces! A docker image was readily available, it could easily bulk-ingest CSV data, plus all the bells-and-whistles of distributed databases. The in-memory rowstore and on-disk columnstore reminded me of the similar optimizations of Google’s internal tools, which made them so incredibly fast. Some of the technologies used at Google are discussed in the white paper An Inside Look at Google BigQuery and some posts in Stack Overflow, where a brief explanation of ColumnIO, Google’s columnar format, is given.
Additionally, the possibility of spinning up a managed instance in any of the main cloud providers seemed interesting, but was not on my radar originally (more on that in a bit).
My SingleStoreModernNew Horizons in the Database Experience:
- My initial experience with SingleStore’s Docker image was very smooth. I was able to get it up and running within minutes. Documentation was up-to-date and I had no trouble setting it all up.
- My Python code could just use MySQL Connector/Python to connect to the instance, it all worked. The workflow calls for scheduled ingestions of a few thousand to a few million records from CSV dumps. LOAD DATA worked like a charm to read them, only taking a few seconds for the largest dataset.
- My previous pipelines exported data into HDF5 that was later analyzed in Mathematica. I found that set up terribly slow and cumbersome compared with SingleStore plus a BI tool like Metabase.
More recently I’ve been tempted to try Singlestore Helios to not even have to bother with the occasional updates (which were pretty easy anyway) or starting/stopping the Docker image. It could not have been easier, again I found the documentation clear and I was able to launch my own cluster with ease. I only needed to update the config files to point my pipelines to the new instance (I chose to host it in AWS) and it just worked.
Additionally, given that I had to make zero code changes, I can easily go back and forth between the Docker instance and the managed cluster, just by updating a configuration file.
There was one wish I had related to the managed service, it was that of being able to suspend it when I’m not using it. I am currently working on a hobby project, I do not need other users accessing the database, hence keeping the instance spinning is a bit wasteful. I reached out to the very responsive SingleStore team and they were happy to let me know that there’s an upcoming on/off feature. So I’ll be able to rely on the managed instance exactly as much as I need to. How great is that?
I have also been playing with UDFs to compute some metrics for my analytics and again, it works. I could simply write function definitions, for scalar functions, in SQL and call them from my queries, leading to much simpler SQL.
Conclusion
I have yet to find any place where SingleStore doesn't meet my needs. Granted, I have not yet pushed it very hard, but it has made me much more productive, allowing me to focus on the logic of my pipelines (Prefect, ingesting & preparing data, and training & deploying some ML models) and building analytics dashboards, for which I am using Metabase (which also works like a charm with SingleStore).
While there’s extensive documentation on how to optimize for high-performance settings for more demanding conditions. I have not yet managed to write any queries that take more than a few seconds at most. Multiple WITH clauses and nested queries are handled gracefully.
I am very happy to have come across SingleStore. It has made me much more productive and provided me with a “database experience” as the one I had at Google. I only look forward to my projects to grow and see how far it can be pushed!
Ero Carrera
Software Engineer, Ex-Googler
Experience SingleStore for yourself! Install SingleStoreDB Self-Managed for FREE or Deploy Singlestore Helios with $600 in FREE credits.