in Engineering


Improving Aggregation Performance on MongoDB®

Sarung Tripathi

VP, Worldwide Field Engineering

There’s a reason most apps (web, mobile, etc.) start off on MongoDB. It’s incredibly easy to get started, no schema required and best of all —  it’s free!

Improving Aggregation Performance on MongoDB®

However as apps evolve, there comes a point where developers find the need to run aggregations on that data. Think of your favorite workout app, for example: millions of telemetry records on speed, distance and your body are nothing without aggregations including: 

  • $count. How many activities of a certain type (running, biking, etc.) has a user done?
  • $lookup. How has my body reacted to certain types of activities?
  • $group: What is my total distance for a given activity type this week?

At SingleStore, we offer a native API for MongoDB, SingleStore Kai™,  to enable seamless and performant execution of aggregation queries without any application code changes and zero ETL data movement. In the rest of this blog, we outline how SingleStore is able to do this and how you can get started today.

Before we go further, let’s address the elephant in the room. You must be thinking, “what is a relational database company doing lecturing me on unstructured data?” SingleStore is a multi-model database with strong support for aggregations on JSON of any kind. Here’s a quick primer on a few of our JSON capabilities before we keep going:

seekable-jsonSeekable JSON

SingleStore is highly performant when seeking large JSON objects. We have built this using a few core engineering efforts:

  • Top-level JSON properties are stored column-wise, and use the same encodings as other data types (i.e., numbers, strings, and dates)
  • Nested properties (inside sub-documents) are stored in sequential column format, using SingleStore's standard encoding
  • Arrays are pivoted and stored in column format
  • Columns that do not regularly appear in the documents (e.g., less than 1 in 10,000) are stored in a catch-all area

sparse-jsonSparse JSON

SingleStore has implemented optimizations to reduce memory usage and processing time of sparse JSON objects. These optimizations include:

  • An algorithm inspired by run-length encoding
  •  A cascading-trigger-based approach for nested JSON structures

The sparse JSON compression technique in SingleStore addresses the issue of excessive NULL values in the internal columnar representation of JSON data. This translates directly into better query performance on complex, sparse JSON.

count$count

One of the most common aggregation functions used in MongoDB is $count. Many MongoDB users struggle to get good performance out of $count operations as they often involve several other steps to get the data in MongoDB to an aggregation-worthy state. For example, if we were running a query to “count all activities of a certain type” in MongoDB, this would require an $unwind and $match —  then a $count function.

db.users.aggregate([
{ $unwind: "$workouts" },
{ $match: { "workouts.type": "run" } },
{ $count: "totalRunWorkouts" }
])

Remember, we can use this exact same syntax used in MongoDB within SingleStore. Under the hood, SingleStore will first unnest the JSON arrays efficiently by handling large batches of rows in parallel. The $match operation is effectively a WHERE clause in SingleStore. When a query with a WHERE clause is executed, SingleStore uses segment elimination to skip all segments where there is no data matching the query. This makes $match highly performant.

Finally, the COUNT itself benefits from vectorized processing. Instead of processing data row-by-row, SingleStore processes large batches of data from a column in one or more relatively simple loops.

Users that run $count, $match and $unwind queries have seen 142x improvements over MongoDB using SingleStore Kai.

lookup$lookup

The $lookup aggregation in MongoDB performs a join to another collection in the same database to filter in documents from the “joined” collection for processing. We would use this, for example, if we wanted to combine workout data with user profile information to display user-specific workout statistics. This query would also use $match, as previously discussed.

db.workouts.aggregate([
{ $match: { userId: "user123" } },
{
$lookup: {
from: "users",
localField: "userId",
foreignField: "userId",
as: "userDetails"
}
}
])

The $lookup in MongoDB will benefit heavily from SingleStore’s fast JOIN capabilities. SingleStore is a distributed SQL database which holds data in partitions based on a simple SHARD key (which can be compound and/or leverage Projections). Sharding by userId in the preceding example would make sure all records with the same UserId live on the same partition. This will make the join operations extremely fast in SingleStore, delivering analytical performance that MongoDB simply can’t.

Users that run $match and $lookup queries have seen 82x improvements over MongoDB using SingleStore Kai.

group$group

Finally, the $group aggregation in Mongo is used by developers to group documents by an identifier and apply aggregations to each group. In SQL, this translates to a GROUP BY. Let’s stick with the workout activities example. Here, the $group function could be used to compute the average speed for each type of activity (e.g., run, bike ride) for a user.

db.workouts.aggregate([
{ $match: { userId: "user123" } },
{ $group: { _id: "$type", averageSpeed: { $avg: "$speed" } } }
])

Once again, we also use $match and leverage the benefits of SHARD and SORT keys in SingleStore. GROUP BY performance will also be drastically improved over MongoDB with these selections. For example, selecting SHARD keys that match common GROUP BY columns will enable local GROUP BY operations. Frequently grouped columns in the SORT KEY also enable StreamingGroupBy operations in SingleStore’s engine, which further improve performance.

Users that run $group queries have seen 299x improvements over MongoDB using SingleStore Kai.

Getting started

Aggregations can be frustrating in MongoDB, but fortunately there’s an alternative with SingleStore Kai. Simply leverage native Mongo CDC to get your data into SingleStore and without changing anything in your app but the connection string, harness the power of SingleStore today.


Share