Experience teaches us that when loading data into a database, in whatever form ― normalized, denormalized, schema-less, hierarchical, key-value, document, etc ― the devil is always in the data load.
For enterprise companies in the real-time economy, every second saved means improved efficiency, productivity, and profitability. Thankfully, SingleStore makes your enterprise data fast to load and easy to access. You can spin up a SingleStore cluster in minutes and load data very quickly using SingleStore Pipelines. SingleStore scales out for fast parallel processing of very large data sets. SingleStore Pipelines deliver exactly-once semantics for duplicate-free, fast data ingest.
In this blog tutorial, we’ll illustrate rapid data ingest with SingleStore using real-world data from the New York City Taxi and Limousine Commission (NYC TLC) yellow taxi trip data set, which is over 200GB of CSV files. We’ll break the tutorial into several sections:
- Pre-configuration requirements for everything you need to complete the tutorial including AWS S3 and SingleStore
- Downloading, compressing, and uploading the yellow taxi data to AWS S3
- Loading the data with SingleStore Pipelines in about 24 minutes
Pre-configuration Requirements
To complete this tutorial, you can use a Mac OS, Linux, or Windows machine, and you’ll need a:
- Bash compatible terminal
- AWS account and S3 bucket with read and write privileges
- AWS Command Line Interface (CLI)
- MySQL compatible client to execute SQL statements
- Running SingleStore cluster
AWS S3
First, log into your AWS account. To get started with AWS S3, review Getting Started with Amazon Simple Storage Service.
We will store approximately 44GB of compressed data versus 200GB of raw data, which typically costs about \$4.50 a month. After loading the data into SingleStore, you can delete the S3 data and terminate the AWS account to avoid additional charges.
AWS Command Line Interface (CLI)
Use the AWS CLI to use shell commands to script S3 operations, such as creating a folder, listing folder items, and uploading files. Follow the User Guide for installing the AWS CLI.
AWS Credentials
To upload data to S3 using the AWS CLI and extract files from S3 with SingleStore Pipelines, you’ll need AWS credentials for an IAM user. For an IAM user, you can easily generate an AWS Access Key ID and AWS Secret Access Key following this guide.
S3 Bucket
Log in to the AWS Console and create a new S3 bucket in the availability zone for your account. The NYC taxi data set includes yellow taxi, green vehicle, and fhv vehicle data. We’ll only use the yellow taxi data in our initial data load as it is by far the largest data set. In the case of yellow taxi data, the schema changes over time, so we’ll create yellow taxi subfolders to demarcate the schema changes. In your S3 bucket, create the following older hierarchy:
|- nyc-taxi
|--- yellow_tripdata
|----- 200901-201412
|----- 201501-201606
|----- 201607-201612
MySQL Compatible Client
SingleStore uses the open source MySQL protocol, so in order to execute Data Definition Language (DDL) and Data Manipulation Language (DML) statements you’ll need a MySQL compatible client. Here are some popular, free clients:
- MySQL Workbench
- Sequel Pro
- MySql Client for Linux APT Repository, Debian Packages\sudo apt-get update\sudo apt-get install mysql-client
- MySQL Client for Linux RPM, Red Hat Enterprise Linux/Oracle Linux/CentOS 5 systems\sudo yum install mysql-community-client mysql-5.*
SingleStore Cluster
A SingleStore cluster consists of at least one master aggregator node and one leaf node.
Setting up SingleStore
SingleStore runs natively on a 64-bit Linux operating systems. Download and install SingleStore either on-premises or in your own self-managed cloud. For Windows or Mac OS, you can use the SingleStore Quickstart Docker image, which is a standalone edition of SingleStore. If using Windows 10, consider installing Bash using the Windows Subsystem for Linux Beta. This beta feature will install Ubuntu on Windows, and you can run SingleStore experimentally in this environment. Very cool.
NYC TLC Trip Data
The NYC TLC data files are as small as 300MB and as large as 2.5GB. With 8 years of monthly data, yellow taxi information accounts for about 90% of all data. Many yellow taxi cab files are approximately 2.4GB. When compressed, the files range from 300MB to almost 600MB.
Download, Compress, and Upload the Data
Here’s a rudimentary Bash script to do the work of downloading each file, compressing the raw files, uploading the compressed file to proper folder in S3, and then deleting both the raw CSV and tar.gz files from the machine where the script runs. The shell script requires AWS CLI.
#!/bin/bash
# Change S3_BUCKET to your S3 bucket
export S3_BUCKET="s3://my-s3-bucket"
# Change S3_BASE_FOLDER to your base folder, but should be nyc-taxi if you are following the tutorial
export S3_BASE_FOLDER="nyc-taxi"
# Change AWS_ACCESS_KEY_ID to your key
export AWS_ACCESS_KEY_ID=MY_ACCESS_KEY
# ChangeAWS_SECRET_ACCESS_KEY to your secret key
export AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY
# Change S3_BASE_FOLDER to your base folder.
export AWS_DEFAULT_PROFILE=my-aws-profile-user
export AWS_DEFAULT_REGION=us-east-1
aws configure --profile $AWS_DEFAULT_PROFILE
# do not change URL_ROOT
URL_ROOT="https://s3.amazonaws.com/nyc-tlc/trip+data/"
# modify if needed for smaller subsets or add green and fhv for additional cab_types
MONTH_ORDINALS=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12")
YEAR_ORDINALS=("2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016")
CAB_TYPES=("yellow")
# leave as empty
FILE_NAME=""
S3_FOLDER=""
S3_SUBFOLDER=""
for name in ${CAB_TYPES[@]} do
if [ $name == "yellow" ]; then
S3_FOLDER="yellow_tripdata"
YEARS=${YEAR_ORDINALS[@]} fi
for yy in ${YEARS[@]} do
MONTHS=${MONTH_ORDINALS[@]} for mm in ${MONTHS[@]} do
FILE_NAME=${name} _tripdata_${yy} -${mm} # get the csv file
curl -S -O "${URL_ROOT} ${FILE_NAME} .csv" && echo "done! curl ${FILE_NAME} .csv" &
wait
# tarball the file
tar -cvzf "${FILE_NAME} .tar.gz" "${FILE_NAME} .csv" && echo "done! tar ${FILE_NAME} .tar.gz" &
wait
# upload to AWS S3 the gz file
if [[ $name == "yellow" && $yy == "2015" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "01" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "02" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "03" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "04" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "05" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "06" ]]; then
S3_SUBFOLDER="201501-201606"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "07" ]]; then
S3_SUBFOLDER="201607-201612"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "08" ]]; then
S3_SUBFOLDER="201607-201612"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "09" ]]; then
S3_SUBFOLDER="201607-201612"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "10" ]]; then
S3_SUBFOLDER="201607-201612"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "11" ]]; then
S3_SUBFOLDER="201607-201612"
elif [[ $name == "yellow" && $yy == "2016" && $mm == "12" ]]; then
S3_SUBFOLDER="201607-201612"
else
S3_SUBFOLDER="200901-201412"
fi
if [ $name == "yellow" ]; then
aws s3 cp ${FILE_NAME} .tar.gz ${S3_BUCKET} /${S3_BASE_FOLDER} /${S3_FOLDER} /${S3_SUBFOLDER} / --profile $AWS_DEFAULT_PROFILE && echo "done! aws s3 cp ${FILE_NAME} .tar.gz" &
fi
wait
#rm the cv files
rm -f "${FILE_NAME} .csv" && echo "done! rm -f ${FILE_NAME} .csv" &
wait
#rm the gz files
rm -f "${FILE_NAME} .tar.gz" && echo "done! rm -f ${FILE_NAME} .tar.gz" &
wait
done
done
done
In order for the script to work with your S3 bucket and folder, specify values for these variables:
- S3_BUCKET
- S3_BASE_FOLDER
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
Save your changes and name the file, nyc_tlc_taxi_files.sh. In the terminal where you are going to run the shell script, modify the file properties:
chmod 777 nyc_tlc_taxi_files.sh
Next, create an AWS profile using the AWS CLI:
AWS_DEFAULT_PROFILE=my-profile
aws configure --profile $AWS_DEFAULT_PROFILE
At the prompts, specify your values for the ‘AWS Access Key ID’ and ‘AWS Secret Access Key’. For ‘Default region name’, specify us-east-1. For ‘Default output format’, specify json.
From a Bash shell, run the nyc_tlc_taxi_files.sh script. There is no need to specify the AWS profile settings again, so press enter at each prompt. The script outputs the successful processing of each file:
HAL2001:shellscripts Seth$ ./nyc_tlc_taxi_files.sh
AWS Access Key ID [****************7G6Q]:
AWS Secret Access Key [****************EmaN]:
Default region name [us-east-1]:
Default output format [json]:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2420M 100 2420M 0 0 4398k 0 0:09:23 0:09:23 --:--:-- 5915k
done! curl yellow_tripdata_2009-01.csv
a yellow_tripdata_2009-01.csv
done! tar yellow_tripdata_2009-01.tar.gz
upload: ./yellow_tripdata_2009-01.tar.gz to s3://my-s3-bucket/my-nyc-taxi-folder/yellow_tripdata/200901-201412/yellow_tripdata_2009-01.tar.gz
done! aws s3 cp yellow_tripdata_2009-01.tar.gz
done! rm -f yellow_tripdata_2009-01.csv
done! rm -f yellow_tripdata_2009-01.tar.gz
Downloading, compressing, and uploading the 96 yellow taxi monthly data files using your local machine will take around six to eleven hours, depending on your connection speeds. So, if you are following along on your laptop, you may want to run the Bash shell script during the night, keeping your machine plugged in with sleep disabled.
To Compress or Not Compress?
With AWS S3, you ultimately pay for storage size and for data transmission. Depending on the availability zone, it costs on average about five dollars to store 200GB. Compressing 200GB gets that down to under 50GB. Because a SingleStore Pipeline for S3 can process a compressed file, you will save by using compression for large data sets on AWS S3.
Load the NYC Yellow Taxi Trip Data
After the bash shell script completes and all NYC Taxi trip data has been loaded in your various S3 bucket folders, we’ll now load all the data using SingleStore S3 Pipelines.
SingleStore Database and Tables
To get started, we’ll create a database and two tables. Because the yellow taxi data has several schema changes over eight years as described in the data dictionary, we’ll create a staging table to handle the various data formats. After the initial S3 Pipeline data ingest, we’ll insert the staging data into a destination table.
Using a MySQL compatible client connect to your Master Aggregator to run DDL commands. To create the database and tables, execute the following SQL:
DROP DATABASE IF EXISTS nyc_taxi;
CREATE DATABASE IF NOT EXISTS nyc_taxi;
USE nyc_taxi;
DROP TABLE IF EXISTS yellow_trips_staging;
CREATE TABLE IF NOT EXISTS yellow_trips_staging (
vendor_id varchar(3) NOT NULL DEFAULT '0',
pickup_datetime datetime NOT NULL,
dropoff_datetime datetime NOT NULL,
pickup_longitude DOUBLE NOT NULL DEFAULT 0,
pickup_latitude DOUBLE NOT NULL DEFAULT 0,
passenger_count tinyint NOT NULL DEFAULT 0,
trip_distance decimal(6,2) NOT NULL DEFAULT 0,
rate_code tinyint NOT NULL DEFAULT 0,
store_and_fwd_flag varchar(1) NOT NULL DEFAULT 'N',
dropoff_longitude DOUBLE NOT NULL DEFAULT 0,
dropoff_latitude DOUBLE NOT NULL DEFAULT 0,
PULocationID smallint(3) NOT NULL DEFAULT 0,
DOLocationID smallint(3) NOT NULL DEFAULT 0,
payment_type tinyint NOT NULL DEFAULT 0,
fare_amount decimal(6,2) NOT NULL DEFAULT 0,
surcharge decimal(6,2) NOT NULL DEFAULT 0,
extra decimal(6,2) NOT NULL DEFAULT 0,
mta_tax decimal(6,2) NOT NULL DEFAULT 0,
tip_amount decimal(6,2) NOT NULL DEFAULT 0,
tolls_amount decimal(6,2) NOT NULL DEFAULT 0,
improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0,
total_amount decimal(7,2) NOT NULL DEFAULT 0,
colA varchar(1),
colB varchar(1),
key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE
);
DROP TABLE IF EXISTS yellow_trips;
CREATE TABLE IF NOT EXISTS yellow_trips (
vendor_id varchar(3) NOT NULL DEFAULT '0',
pickup_datetime datetime NOT NULL,
pickup_year smallint(4) NOT NULL DEFAULT 0,
pickup_month tinyint(2) NOT NULL DEFAULT 0,
pickup_day tinyint(2) NOT NULL DEFAULT 0,
pickup_week tinyint(2) NOT NULL DEFAULT 0,
pickup_time TIME NOT NULL DEFAULT '00:000:00',
dropoff_datetime datetime NOT NULL,
dropoff_year smallint(4) NOT NULL DEFAULT 0,
dropoff_month tinyint(2) NOT NULL DEFAULT 0,
dropoff_day tinyint(2) NOT NULL DEFAULT 0,
dropoff_week tinyint(2) NOT NULL DEFAULT 0,
dropoff_time TIME NOT NULL DEFAULT '00:000:00',
pickup_pt geographypoint not null default 'POINT(0 0)',
dropoff_pt geographypoint not null default 'POINT(0 0)',
PULocationID smallint(3) NOT NULL DEFAULT 0,
DOLocationID smallint(3) NOT NULL DEFAULT 0,
passenger_count tinyint(2) NOT NULL DEFAULT 0,
trip_distance decimal(6,2) NOT NULL DEFAULT 0,
rate_code tinyint NOT NULL DEFAULT 0,
store_and_fwd_flag varchar(1) NOT NULL DEFAULT 'N',
payment_type tinyint NOT NULL DEFAULT 0,
fare_amount decimal(6,2) NOT NULL DEFAULT 0,
surcharge decimal(6,2) NOT NULL DEFAULT 0,
extra decimal(6,2) NOT NULL DEFAULT 0,
mta_tax decimal(6,2) NOT NULL DEFAULT 0,
tip_amount decimal(6,2) NOT NULL DEFAULT 0,
tolls_amount decimal(6,2) NOT NULL DEFAULT 0,
improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0,
total_amount decimal(7,2) NOT NULL DEFAULT 0,
shard key(pickup_datetime, dropoff_datetime),
key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE
);
SingleStore Pipelines
Declared ever so simply but ever so powerfully with a CREATE PIPELINE statement, SingleStore Pipelines are a native construct in SingleStore. The pipeline syntax allows us to extract one or more gzip compressed (or uncompressed) source files from a S3 bucket folder, then load the data to a destination database table. In other words, we’ll end up with three SingleStore Pipelines to match our S3 folders.
Create Pipelines
You’ll need to specify your AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID credentials in the following CREATE PIPELINE statements. In the statements, we specify to skip all CSV parsing errors. The three pipelines will load data rapidly and in parallel to the staging table:
USE nyc_taxi;
DROP PIPELINE IF EXISTS pipeline_yellow_200901_201412;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_200901_201412 AS
-- IMPORTANT: CHANGE my-s3-bucket
LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/200901-201412/"
-- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id
CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"} '
SKIP ALL ERRORS
INTO TABLE yellow_trips_staging
FIELDS TERMINATED BY ','
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount);
DROP PIPELINE IF EXISTS pipeline_yellow_201501_201606;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201501_201606 AS
-- IMPORTANT: CHANGE my-s3-bucket
LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/201501-201606/"
-- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id
CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"} '
SKIP ALL ERRORS
INTO TABLE yellow_trips_staging
FIELDS TERMINATED BY ','
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount);
DROP PIPELINE IF EXISTS pipeline_yellow_201607_201612;
CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201607_201612 AS
-- IMPORTANT: CHANGE my-s3-bucket
LOAD DATA S3 "my-s3-bucket/nyc-taxi/yellow_tripdata/201607-201612/"
-- IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id
CREDENTIALS '{"aws_secret_access_key": "my_aws_secret_access_key", "aws_access_key_id": "my_aws_access_key_id"} '
SKIP ALL ERRORS
INTO TABLE yellow_trips_staging
FIELDS TERMINATED BY ','
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,@,@);
Now, we can start the pipelines with:
USE nyc_taxi;
START PIPELINE pipeline_yellow_200901_201412;
START PIPELINE pipeline_yellow_201501_201606;
START PIPELINE pipeline_yellow_201607_201612;
Verify that your pipelines are running.
USE nyc_taxi;
SHOW PIPELINES;
You can view the files in the batch as identified by BATCH_ID in the following:
USE nyc_taxi;
SELECT * FROM information_schema.PIPELINES_FILES WHERE DATABASE_NAME = 'nyc_taxi';
Parallel Loading with SingleStore Pipelines
To determine the number of files in a pipeline batch, SingleStore uses the database partition count.
SHOW PARTITIONS on nyc_taxi;
The distinct ordinal count of master partitions reflects the total number of database partitions.
Load Skew
Smaller sized files in a pipeline batch will load faster than larger files. Only when all files in a batch are successfully loaded in parallel, does the next pipeline batch begin. For the NYC yellow taxi trip data set, the monthly files vary greatly in size, as great as 210MB. In other words, there will be a degree of load skew. The pipeline with the most files to process is pipeline_yellow_200901_201412. It processes 6 years of monthly data, which is 72 files. A database sized with 32 partitions will create three pipeline batches for a pipeline with 72 files:
SELECT PIPELINE_NAME, BATCH_ID, count(*) as File_Count
FROM information_schema.PIPELINES_FILES
WHERE DATABASE_NAME = 'nyc_taxi'
GROUP BY PIPELINE_NAME, BATCH_ID
ORDER BY PIPELINE_NAME, BATCH_ID;
To view the overall batch time in minutes, the total number of rows, and rows per second, execute the following query:
SELECT Round((SUM(v1.BATCH_TIME) / 60),2) as Total_Minutes, FORMAT(SUM(v1.BATCH_ROWS_WRITTEN),0) as Total_Rows
,FORMAT(SUM(v1.BATCH_ROWS_WRITTEN)/ (SUM(v1.BATCH_TIME)),3) as Rows_Per_Sec
FROM (
SELECT DISTINCT t1.BATCH_ID, t1.BATCH_TIME, t1.BATCH_ROWS_WRITTEN
FROM INFORMATION_SCHEMA.PIPELINES_BATCHES as t1
WHERE t1.BATCH_STATE = 'Succeeded' AND t1.DATABASE_NAME = 'nyc_taxi'
) as v1
The results for loading all the yellow taxi data into a large sized cluster in SingleStore are:
- Total_Minutes = 23.94
- Total_Row = 1,308,985,065
- Rows_Per_Sec = 911,305.258
Now that the load is complete, we can stop the pipelines.
USE nyc_taxi;
STOP PIPELINE pipeline_yellow_200901_201412;
STOP PIPELINE pipeline_yellow_201501_201606;
STOP PIPELINE pipeline_yellow_201607_201612;
SHOW PIPELINES;
From Staging to Destination
With 1.3 billion rows, there will be a few load errors. We can view the files with the most errors with the following query:
SELECT v1.BATCH_SOURCE_PARTITION_ID, v1.ERROR_CODE, COUNT(v1.ERROR_ID) as Total_Errors
FROM (
SELECT BATCH_SOURCE_PARTITION_ID,ERROR_CODE,ERROR_ID
FROM information_schema.PIPELINES_ERRORS
) as v1
GROUP BY v1.BATCH_SOURCE_PARTITION_ID
ORDER BY total_errors DESC;
There are two files with around 650 row-related errors out of 1.308 billion rows. Of course, there are a few issues with the data itself. There are rows with 0000-00-00 00:00:00 datetimes and rows with meangliness longitudes and latitudes. We can clean these up with some basic DELETE and UPDATE statements that will take less than 20 seconds to complete.
USE nyc_taxi;
DELETE FROM yellow_trips_staging WHERE pickup_datetime = '0000-00-00 00:00:00' AND dropoff_datetime = '0000-00-00 00:00:00';
UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude < -180.00 OR pickup_latitude > 90.00);
UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude > 180.00 OR pickup_latitude < -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = 180.00 OR pickup_latitude = 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = -180.00 OR pickup_latitude = -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude < -180.00 OR dropoff_latitude > 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude > 180.00 OR dropoff_latitude < -90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = 180.00 OR dropoff_latitude = 90.00);
UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = -180.00 OR dropoff_latitude = -90.00);
With the data cleaned up, we are now ready to copy the data into our destination table. As you may recall, the DDL for yellow_trips table specifies the pickup and dropoff point as a SingleStore geographypoint data type, enabling us to write fast geospatial queries moving forward.
USE nyc_taxi;
INSERT INTO yellow_trips
SELECT vendor_id, pickup_datetime,year(pickup_datetime) as pickup_year, month(pickup_datetime) as pickup_month, day(pickup_datetime) as pickup_day
, week(pickup_datetime) as pickup_week, time(pickup_datetime) as pickup_time, dropoff_datetime, year(dropoff_datetime) as droppoff_year
, month(dropoff_datetime) as droppoff_month, day(dropoff_datetime) as droppoff_day, week(dropoff_datetime) as droppoff_week, time(dropoff_datetime) as droppoff_time
, GEOGRAPHY_POINT(pickup_longitude , pickup_latitude) as pickup_pt, GEOGRAPHY_POINT(dropoff_longitude, dropoff_latitude) as dropoff_pt
, PULocationID, DOLocationID, passenger_count, trip_distance, rate_code, store_and_fwd_flag, payment_type, fare_amount, surcharge, extra, mta_tax
, tip_amount , tolls_amount, improvement_surcharge, total_amount
FROM yellow_trips_staging;
Finally, we’ll clean up the staging table and update the statistics for the yellow_trips table.
USE nyc_taxi;
TRUNCATE TABLE yellow_trips_staging;
ANALYZE TABLE yellow_trips;
OPTIMIZE TABLE yellow_trips;
Now, we’re ready to start analyzing the data!
Data Science Inspiration
For those looking for inspiration about what and how to analyze the NYC yellow taxi data, take a look at Todd W. Schneider’s now famous blog “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance”. As Todd writes, “taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York.”
Next Steps
The SingleStore story is one of relentless innovation and brilliant engineering.
Just five years ago, SingleStore posted “Loading half a billion records in 40 minutes”. That blog details how to chunk up manufactured data into 2000 equally sized files in order to load 500 million rows into SingleStore on a machine with 512GB memory and 64 cores.
To preview the next chapter in the amazing SingleStore story that includes an 80X improvement in columnstore query execution and AI similarity matching with SIMD functions, sign up for the SingleStore Beta.