Mongo Atlas & SingleStore Kai
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
No code change required! 100% MongoDB notebook!
Given the small dataset, the objective of that notebook is not to compare performance.
What you will learn in this notebook:
Install libraries and import modules
Connect to a MongoDB Atlas and SingleStore Kai endpoints
Copy Atlas collections into SingleStore - Synthetic collections are about retail sales transactions with customer information
Compare performance on same code from simple to more complex queries
Document counts
Product quantity sold
Average customer satisfaction
Average satisfaction per product
Number of transactions by Location and membership
Top 10 product sales
1. Install libraries and import modules
Make sure that you have a created MongoDB enabled workspace.
This must be done when creating a workspace (for Standard/Premium Workspaces) For Starter Workspaces, the KAI API will be on by default.
In [1]:
!pip install pymongo pandas matplotlib plotly ipywidgets --quiet
In [2]:
shared_tier_check = %sql show variables like 'is_shared_tier'if shared_tier_check and shared_tier_check[0][1] == 'ON':current_database = %sql SELECT DATABASE() as CurrentDatabasedatabase_to_use = current_database[0][0]else:database_to_use = "new_transactions"%sql CREATE DATABASE {{database_to_use}}
Action Required
Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [3]:
import osimport timeimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.subplots as spimport pymongofrom pymongo import MongoClientfrom plotly.offline import plot, iplot, init_notebook_mode
2. Connect to Atlas and SingleStore Kai endpoints
We are using a shared tier on the backend for Atlas
In [4]:
# No need to edit anythingmyclientmongodb = pymongo.MongoClient("mongodb+srv://mongo_sample_reader:SingleStoreRocks27017@cluster1.tfutgo0.mongodb.net/?retryWrites=true&w=majority")mydbmongodb = myclientmongodb["new_transactions"]mongoitems = mydbmongodb["items"]mongocusts = mydbmongodb["custs"]mongotxs = mydbmongodb["txs"]
Select the workspace that you want to use.
In [5]:
s2clientmongodb = pymongo.MongoClient(connection_url_kai)s2dbmongodb = s2clientmongodb[database_to_use]s2mongoitems = s2dbmongodb["items"]s2mongocusts = s2dbmongodb["custs"]s2mongotxs = s2dbmongodb["txs"]
3. Copy Atlas collections into SingleStore Kai
In [6]:
mongocollections = [mongoitems, mongocusts, mongotxs]for mongo_collection in mongocollections:df = pd.DataFrame(list(mongo_collection.find())).reset_index(drop=True)data_dict = df.to_dict(orient='records')s2mongo_collection = s2dbmongodb[mongo_collection.name]s2mongo_collection.insert_many(data_dict)
Count documents in SingleStore
In [7]:
mg_count = s2mongoitems.count_documents({})mg_count
Compare Queries and Performance
In-app analytics is everywhere.
4. Document counts
In [8]:
num_iterations = 10mongo_times = []for i in range(num_iterations):mg_start_time = time.time()mg_count = mongoitems.count_documents({})mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range(num_iterations):s2_start_time = time.time()s2_count = s2mongoitems.count_documents({})s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)df = pd.DataFrame.from_dict({'iteration': list(range(1, num_iterations + 1)),'mongo_times': mongo_times,'s2_times': s2_times,})df_2 = pd.DataFrame.from_dict({'counts': [mg_count, s2_count],'connection_type': ["mongodb", "singlestore"],})figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),px.bar(df_2, x="connection_type", y="counts", color="connection_type")]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Document Count Execution Time","Document Counts"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i + 1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5,6,7,8,9,10], row=1, col=1)fig
5. Product Quantity Sold
In [9]:
num_iterations = 10mongo_times = []pipeline = [{"$group": {"_id": "$item.name", "totalQuantity": {"$sum": "$item.quantity"}}},{"$sort": {"totalQuantity": -1}},{"$limit": 5},]for i in range(num_iterations):mg_start_time = time.time()mg_result = mongoitems.aggregate(pipeline)mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range(num_iterations):s2_start_time = time.time()s2_result = s2mongoitems.aggregate(pipeline)s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)x_axis = list(range(1, num_iterations + 1))data = {'iteration': x_axis,'mongo_times': mongo_times,'s2_times': s2_times,}df = pd.DataFrame.from_dict(data)item_names = []item_quantity = []for i in mg_result:item_names.append(i["_id"])item_quantity.append(i["totalQuantity"])figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),px.bar(x=item_names, y=item_quantity)]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Comparison of Product Quantity Sold"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i+1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)fig
6. Average Customer Satisfaction
In [10]:
num_iterations = 10mongo_times = []pipeline = [{'$group':{'_id': '$customer.email','average_satisfaction': {'$avg': '$customer.satisfaction'},},},{'$limit': 10},]for i in range(num_iterations):mg_start_time = time.time()mg_result = mongotxs.aggregate(pipeline)mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range(num_iterations):s2_start_time = time.time()s2_result = s2mongotxs.aggregate(pipeline)s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)x_axis = list(range(1, num_iterations + 1))data = {'iteration': x_axis,'mongo_times': mongo_times,'s2_times': s2_times,}df = pd.DataFrame.from_dict(data)item_names = []item_quantity = []for i in mg_result:item_names.append(i["_id"])item_quantity.append(i["average_satisfaction"])figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),px.bar(x=item_names, y=item_quantity)]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Average Customer Satisfaction"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i+1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)fig
7. Average Satisfaction per Product
In [11]:
num_iterations = 10mongo_times = []pipeline = [{"$lookup":{"from": "txs","localField": "tx_id","foreignField": "transaction_id","as": "transaction_links",}},{"$limit": 10 },{"$unwind": "$transaction_links"},{"$group":{"_id": {"item": "$item.name"},"Average Satisfaction": {"$avg": "$transaction_links.customer.satisfaction"}}}]for i in range(num_iterations):mg_start_time = time.time()mg_result = mongoitems.aggregate(pipeline)mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range(num_iterations):s2_start_time = time.time()s2_result = s2mongoitems.aggregate(pipeline)s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)x_axis = list(range(1, num_iterations + 1))data = {'iteration': x_axis,'mongo_times': mongo_times,'s2_times': s2_times,}df = pd.DataFrame.from_dict(data)item_names = []item_quantity = []for i in mg_result:item_names.append(i["_id"]['item'])item_quantity.append(i["Average Satisfaction"])figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),px.bar(x=item_names, y=item_quantity)]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Average Satisfaction per Product"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i+1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)fig
8. Number of transactions by location and membership
In [12]:
num_iterations = 10mongo_times = []pipeline = [{"$lookup":{"from": "custs","localField": "customer.email","foreignField": "email","as": "transaction_links",}},{"$limit": 100},{"$group":{"_id": {"location": "$store_location","membership": "$transaction_links.membership"},"count": {"$sum": 1}}},{"$sort": {"count":-1}}]for i in range (num_iterations):mg_start_time = time.time()mg_result = mongotxs.aggregate(pipeline)mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range (num_iterations):s2_start_time = time.time()s2_result = s2mongotxs.aggregate(pipeline)s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)x_axis = list(range(1, num_iterations + 1))data = {'iteration': x_axis,'mongo_times': mongo_times,'s2_times': s2_times,}df = pd.DataFrame.from_dict(data)item_names = []item_quantity = []for i in mg_result:toadd = i["_id"]['location'] + ', ' + i["_id"]['membership'][0]item_names.append(toadd)item_quantity.append(i['count'])figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']),px.bar(x=item_names, y=item_quantity)]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Sales per Store"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i+1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)fig
9. Top 10 Product Sales
In [13]:
num_iterations = 10mongo_times = []pipeline = [{"$project":{"item": 1,"revenue": {"$multiply": ["$item.price", "$item.quantity"]}}},{"$group":{"_id": "$item.name","total_revenue": {"$sum": "$revenue"}}},{"$sort": {"total_revenue": -1}},{"$limit": 10},]for i in range (num_iterations):mg_start_time = time.time()mg_result = mongoitems.aggregate(pipeline)mg_stop_time = time.time()mongo_times.append(mg_stop_time - mg_start_time)s2_times = []for i in range (num_iterations):s2_start_time = time.time()s2_result = s2mongoitems.aggregate(pipeline)s2_stop_time = time.time()s2_times.append(s2_stop_time - s2_start_time)x_axis = [i + 1 for i in range(num_iterations)]data = {'iteration': x_axis,'mongo_times': mongo_times,'s2_times': s2_times,}df = pd.DataFrame.from_dict(data)item_names = []item_quantity = []for i, result in enumerate(mg_result):if i >= 1:toadd = result["_id"]item_names.append(toadd)item_quantity.append(result['total_revenue'])figures = [px.line(df.iloc[1:], x='iteration', y=['mongo_times', 's2_times']), # Exclude the first iteration from the line chartpx.bar(x=item_names, y=item_quantity)]fig = sp.make_subplots(rows=1, cols=2, subplot_titles=["Execution Time","Top 10 Product Sales"])for i, figure in enumerate(figures):for trace in range(len(figure["data"])):fig.append_trace(figure["data"][trace], row=1, col=i+1)fig.update_yaxes(title_text="Time in Seconds", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(title_text="Iteration", row=1, col=1)fig.update_xaxes(tickmode='array', tickvals=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], row=1, col=1)fig
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [14]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS new_transactions;
Details
About this Template
Run your MongoDB queries on SingleStore Kai with no code change.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.