Telecom Fraud Detection Through CDR Using SingleStore Vector Functions

AI offers a dynamic and sophisticated means to detect and prevent telecom anomalies, which often involves identifying unusual patterns in data. Vectorization makes the database detecting these anomalies more efficient, executing the complex queries needed for pattern recognition in some fraud cases. SingleStore can play a vital role in setting pattern recognition using its vector capabilities.

What is SingleStore?

SingleStore is a real-time, distributed SQL database built for AI. With familiar SQL tooling and MySQL wire protocol compatibility, SingleStore eliminates the need for specialized databases and simplifies data architectures.

SingleStore is also built to handle multiple data types — including JSON, time-series, geospatial, full-text search and vectors — delivering high-speed data ingestion on a unified transactional and analytical foundation.

Vector capabilities of SingleStore

SingleStore supports vector database processing, allowing you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc., and vector database searches find data based on content or meaning — even without exact matches. For example, vector search allows for semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.

Some benefits of using SingleStore for vector database processing — as opposed to a specialized vector database system — are:

A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support and more.
Less data movement is needed between different data subsystems (e.g., caches, text search systems and SQL databases) when all t data, including vector data, is stored in SingleStore.
Operational costs may be reduced since fewer data management tools and copies of data are needed.
Less specialized skills and reduced labor are needed to run an application environment.

Anomalies in telecom CDR

Anomalies in telecom CDR (Call Detail Record) refer to irregular or unusual patterns in the data that records details of phone calls, SMS messages or data usage in a telecommunications network. A report from the Communication Fraud Control Association (CFCA) reveals that telecommunications fraud increased 12% in 2023, equating to an estimated $38.95 billion lost to fraud.

**Source:** *Communication Fraud Control Association, Nov. 2023*

Anomalies in CDR records are a good source to detect frauds in telecommunication.
We will address the following frauds through vector capabilities of SingleStore in telecom CDR records:

A. International revenue share fraud

B. Wangiri fraud

C. Subscription fraud

D. PBX hacking

Lets understand these frauds through sample CDR records

Explanation:

International revenue share fraud (IRSF):

Example: Call ID 1
Signs: The call is unusually long (1 hour) to an international number and incurs a high cost. IRSF typically involves making long calls to premium-rate international numbers owned by the fraudster.
Detection: Look for long-duration calls to known high-cost international destinations.

Wangiri fraud:

Example: Call ID 2
Signs: The call has a very short duration (10 seconds) to an international number, which is a common sign of Wangiri fraud where the fraudster expects a callback.
Detection: Track short missed or very short duration calls to international numbers, followed by callbacks.

Subscription fraud:

Example: Call ID 3
Signs: This is harder to detect directly from a CDR. However, if this number is new, recently activated and immediately starts making expensive international calls, it could be a sign
Detection: Monitor new accounts for unusual calling patterns, especially to high-cost destinations.

PBX hacking:

Example: Call ID 5
Signs: The caller number is an extension (e.g., 1234), indicating a call made from a PBX system, and it's a long-duration international call. Hackers often break into PBX systems to make long or numerous international calls.
Detection: Look for unusual call patterns from PBX extensions, like calls at odd hours, to international destinations or of unusually long duration.

Detecting anomalies for these frauds using SingleStore vector capabilities

Step 1: Setting up the SingleStore database environment

1.1. Create the database and table to store Telecom CDR

1
drop database IF EXISTS cdranomalies;
2
create database cdranomalies;
3
use cdranomalies;
4
drop table if EXISTS telecom_cdr;
5
create table telecom_cdr(Call_ID bigint auto_increment not null primary key,
6
Caller_Number varchar(50),
7
Callee_Number varchar(50),
8
Start_Time timestamp,
9
Duration bigint,
10
Cost float,
11
Call_Type varchar(50),
12
Usage_Type varchar(50),
13
Embeddings blob,
14
Anomaly varchar(100));

1.2. Create and start the pipeline to ingest records into the table

CREATE OR REPLACE  AGGREGATOR PIPELINE telecomcdr_pipeline AS

LOAD DATA S3 's3://gpsteam/telecomcdr/combined_cdr_data.csv'

CONFIG '{"region":"us-east-1"}'

CREDENTIALS.                     '{"aws_access_key_id":                  "","aws_secret_access_key": 

"","aws_session_token": ""}'

SKIP DUPLICATE KEY ERRORS

INTO TABLE telecom_cdr 

FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE                                                                                                                                    1

LINES(Caller_Number,Callee_Number,Start_Time,Duration,Cost,Call_Type,Usage_Type;

START PIPELINE telecomcdr_pipeline;

1.3. Stored procedure to detect and mark anomalies for CDR records in the table

1
CREATE OR REPLACE PROCEDURE detectanomaly() AS
2
BEGIN
3
    UPDATE telecom_cdr SET Anomaly = CASE    
4
WHEN 
5
dot_product(Embeddings,JSON_ARRAY_PACK("[0.006190149579197168,........,-0.0149
6
34113249182701]")) >= 0.981 THEN 'IRSF'
7
WHEN 
8
dot_product(Embeddings,JSON_ARRAY_PACK("[-0.006254644598811865,.........,-0.026
9
534441858530045]")) >= 0.981 THEN 'Wangiri Fraud'
10
WHEN 
11
dot_product(Embeddings,JSON_ARRAY_PACK("[0.0009138236637227237,.........,-0.018
12
494699150323868]")) >= 0.981 THEN 'PBX Hacking'
13
        ELSE NULL
14
    END;
15
END;

Step 2: Setting up the environment and writing the Python program to generate vector embeddings, marking anomalies for different frauds and its visual representation over the time

2.1. Environment setup for Python program

!pip install pymysql

!pip install panda

!pip install openai

!pip install singlestoredb

!pip install sqlalchemy

!pip install mysql-connector-python

!pip install matplotlib

!pip install plotly

2.2. Actual Python program

1
import pandas as pd
2
import sqlalchemy
3
import openai
4
from singlestoredb import connect
5
from sqlalchemy import create_engine,text
6
import matplotlib.pyplot as plt
7
from matplotlib.dates import DateFormatter
8
import matplotlib.dates as mdates
9
import numpy as np
10
import datetime
11
import plotly.express as px
12
import plotly.graph_objs as go
13
# OpenAI API key
14
openai.api_key = 
15
'sk-DXwarIh88HeIexe7VnwzX3BlbkFJOZim8Mot9Y2YHupF2n2x'
16

17
# Create a connection to the database
18
engine = create_engine(connection_url)
19

20
# Query the database to get the data from 'telecom_cdr' table
21
#df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type` FROM 
22
telecom_cdr', engine)
23
df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type`, 
24
`Start_Time`, `Anomaly` FROM telecom_cdr', engine)
25
# Function to generate embeddings
26
def generate_embeddings(row):
27
    text = f"Duration: {row['Duration']}, Cost: {row['Cost']}, Call Type: 
28
{row['Call_Type']}"
29
    response = openai.embeddings.create(input=[text],
30
model="text-embedding-ada-002")
31
    return response.data[0].embedding
32

33
# Apply the function to each row
34
df['Embeddings'] = df.apply(generate_embeddings, axis=1)
35
with engine.connect() as connection:
36
# Update the database with the new embeddings
37
    for index, row in df.iterrows():
38
        update_query = f"""UPDATE telecom_cdr SET `Embeddings` = 
39
JSON_ARRAY_PACK(:embeddings) WHERE Call_ID = :call_id"""
40
        connection.execute(text(update_query), {"embeddings": 
41
str(row['Embeddings']), "call_id": row['Call_ID']})
42
with engine.connect() as connection1:
43
    update_query1 = f"""CALL detectanomaly();"""
44
    connection1.execute(text(update_query1))
45

46
anomaly_query = """
47
SELECT 
48
    DATE_FORMAT(Start_Time, '%Y-%m') AS Month, 
49
    Anomaly, 
50
    COUNT(*) AS Anomaly_Count 
51
FROM 
52
    telecom_cdr 
53
GROUP BY 
54
    Month, Anomaly
55
"""
56
anomaly_df = pd.read_sql(anomaly_query, engine)
57

58
# Pivot the dataframe to get a column for each anomaly type
59
pivot_df = anomaly_df.pivot(index='Month', columns='Anomaly', 
60
values='Anomaly_Count').fillna(0)
61

62
# Reset index to make 'Month' a column again
63
pivot_df.reset_index(inplace=True)
64

65
# Convert the pivoted dataframe into a format suitable for Plotly
66
plotly_data = pivot_df.melt(id_vars=['Month'], var_name='Anomaly', 
67
value_name='Anomaly_Count')
68

69
# Ensure the 'Month' column is in datetime format for proper plotting
70
plotly_data['Month'] = pd.to_datetime(plotly_data['Month'])
71

72
# Create the line plot using Plotly
73
fig = px.line(plotly_data, 
74
              x='Month', 
75
              y='Anomaly_Count', 
76
              color='Anomaly',
77
              title='Anomaly Counts Over Time',
78
              labels={'Anomaly_Count': 'Anomaly Count', 'Month': 'Month'})
79

80
# Update layout for better readability
81
fig.update_layout(
82
    xaxis_title='Month',
83
    yaxis_title='Anomaly Count',
84
    legend_title='Anomaly Types',
85
    xaxis_tickangle=-45
86
)
87

88
# Show the plot
89
fig.show()# Close the database connection
90
engine.dispose()
91

2.3. Visual graph for anomaly detection.

3. Here are the steps demonstrating how how SingleStore works in the preceding Python program to detect anomalies for fraud detection

3.1. Importing required packages and defining OpenAI key

3.2. Create the database connection and query the database to get data from the table to mark anomalies.

3.3. Defining a function to generate embeddings using the OpenAI method text-embedding-ada-002 model.

3.4. Apply the function “generate_embeddings” to generate embeddings for each row. In this section I am also running one database stored procedure “detectanomaly” to mark an anomaly to each record if the row satisfies conditions mentioned in the preceding Explanation¹ section.

How to mark anomalies for CDR

First, generate a base embedding for each condition mentioned in the Explanation¹section. These base embeddings help mark anomalies for each record near the base embeddings using the SingleStore DOT_PRODUCT vector function. In my case, I have considered the duration, cost and call_type column to generate base embeddings. You can also include other factors to generate this base embedding.

In the second step, get the embeddings of each CDR using the function “generate_embeddings”. This function is used in the previous step to generate embeddings for each CDR.

Finally, it's time to mark anomalies for each CDR matching or near the base embeddings. For this I have used one SingleStore stored procedure, “detectanomaly” which updates each the CDR with particular anomalies including 'IRSF,' 'Wangiri fraud' and 'PBX hacking'.

4. Criteria to mark anomalies

The stored procedure “detectanomaly” calculates whether each record is anomalous based on the SingleStore DOT_PRODUCT function of its embeddings with predefined vectors, each representing a specific type of fraud. If a certain threshold is met or exceeded, the record is flagged with the corresponding type of fraud.

DOT_PRODUCT() can be used to compute a cosine similarity metric of the two input vectors, if the input vectors are normalized to length 1.

In the stored procedure I have set the threshold 0.981 to mark anomalies of similar type based on a criteria of duration, cost and call_type — but you can set your own threshold to mark anomalies based on specific requirements.