AI offers a dynamic and sophisticated means to detect and prevent telecom anomalies, which often involves identifying unusual patterns in data. Vectorization makes the database detecting these anomalies more efficient, executing the complex queries needed for pattern recognition in some fraud cases. SingleStore can play a vital role in setting pattern recognition using its vector capabilities.
What is SingleStore?
SingleStore is a real-time, distributed SQL database built for AI. With familiar SQL tooling and MySQL wire protocol compatibility, SingleStore eliminates the need for specialized databases and simplifies data architectures.
SingleStore is also built to handle multiple data types — including JSON, time-series, geospatial, full-text search and vectors — delivering high-speed data ingestion on a unified transactional and analytical foundation.
Vector capabilities of SingleStore
SingleStore supports vector database processing, allowing you to store and search vector data. A typical vector search locates the set of vectors that most closely match a query vector. Vectors usually come from objects: text, images, video, audio, etc., and vector database searches find data based on content or meaning — even without exact matches. For example, vector search allows for semantic search of text, where a query about "meals" could return information about "lunch" and "dinner" without using those words because they are similar in meaning.
Some benefits of using SingleStore for vector database processing — as opposed to a specialized vector database system — are:
- A broad array of standard modern database capabilities are available in SingleStore. These include SQL, fast distributed and parallel query processing, full-text search, extensibility, ACID transactions, high availability, disaster recovery, point-in-time recovery, broad connectivity support and more.
- Less data movement is needed between different data subsystems (e.g., caches, text search systems and SQL databases) when all t data, including vector data, is stored in SingleStore.
- Operational costs may be reduced since fewer data management tools and copies of data are needed.
- Less specialized skills and reduced labor are needed to run an application environment.
Anomalies in telecom CDR
Anomalies in telecom CDR (Call Detail Record) refer to irregular or unusual patterns in the data that records details of phone calls, SMS messages or data usage in a telecommunications network. A report from the Communication Fraud Control Association (CFCA) reveals that telecommunications fraud increased 12% in 2023, equating to an estimated $38.95 billion lost to fraud.
Anomalies in CDR records are a good source to detect frauds in telecommunication.
We will address the following frauds through vector capabilities of SingleStore in telecom CDR records:
A. International revenue share fraud
B. Wangiri fraud
C. Subscription fraud
D. PBX hacking
Lets understand these frauds through sample CDR records
Explanation:
International revenue share fraud (IRSF):
- Example: Call ID 1
- Signs: The call is unusually long (1 hour) to an international number and incurs a high cost. IRSF typically involves making long calls to premium-rate international numbers owned by the fraudster.
- Detection: Look for long-duration calls to known high-cost international destinations.
Wangiri fraud:
- Example: Call ID 2
- Signs: The call has a very short duration (10 seconds) to an international number, which is a common sign of Wangiri fraud where the fraudster expects a callback.
- Detection: Track short missed or very short duration calls to international numbers, followed by callbacks.
Subscription fraud:
- Example: Call ID 3
- Signs: This is harder to detect directly from a CDR. However, if this number is new, recently activated and immediately starts making expensive international calls, it could be a sign
- Detection: Monitor new accounts for unusual calling patterns, especially to high-cost destinations.
PBX hacking:
- Example: Call ID 5
- Signs: The caller number is an extension (e.g., 1234), indicating a call made from a PBX system, and it's a long-duration international call. Hackers often break into PBX systems to make long or numerous international calls.
- Detection: Look for unusual call patterns from PBX extensions, like calls at odd hours, to international destinations or of unusually long duration.
Detecting anomalies for these frauds using SingleStore vector capabilities
Step 1: Setting up the SingleStore database environment
1.1. Create the database and table to store Telecom CDR
drop database IF EXISTS cdranomalies;create database cdranomalies;use cdranomalies;drop table if EXISTS telecom_cdr;create table telecom_cdr(Call_ID bigint auto_increment not null primary key,Caller_Number varchar(50),Callee_Number varchar(50),Start_Time timestamp,Duration bigint,Cost float,Call_Type varchar(50),Usage_Type varchar(50),Embeddings blob,Anomaly varchar(100));
1.2. Create and start the pipeline to ingest records into the table
CREATE OR REPLACE AGGREGATOR PIPELINE telecomcdr_pipeline ASLOAD DATA S3 's3://gpsteam/telecomcdr/combined_cdr_data.csv'CONFIG '{"region":"us-east-1"}'CREDENTIALS. '{"aws_access_key_id": "","aws_secret_access_key":"","aws_session_token": ""}'SKIP DUPLICATE KEY ERRORSINTO TABLE telecom_cdrFIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'LINES TERMINATED BY '\n'IGNORE 1LINES(Caller_Number,Callee_Number,Start_Time,Duration,Cost,Call_Type,Usage_Type;START PIPELINE telecomcdr_pipeline;
1.3. Stored procedure to detect and mark anomalies for CDR records in the table
CREATE OR REPLACE PROCEDURE detectanomaly() ASBEGINUPDATE telecom_cdr SET Anomaly = CASEWHENdot_product(Embeddings,JSON_ARRAY_PACK("[0.006190149579197168,........,-0.014934113249182701]")) >= 0.981 THEN 'IRSF'WHENdot_product(Embeddings,JSON_ARRAY_PACK("[-0.006254644598811865,.........,-0.026534441858530045]")) >= 0.981 THEN 'Wangiri Fraud'WHENdot_product(Embeddings,JSON_ARRAY_PACK("[0.0009138236637227237,.........,-0.018494699150323868]")) >= 0.981 THEN 'PBX Hacking'ELSE NULLEND;END;
Step 2: Setting up the environment and writing the Python program to generate vector embeddings, marking anomalies for different frauds and its visual representation over the time
2.1. Environment setup for Python program
!pip install pymysql!pip install panda!pip install openai!pip install singlestoredb!pip install sqlalchemy!pip install mysql-connector-python!pip install matplotlib!pip install plotly
2.2. Actual Python program
import pandas as pdimport sqlalchemyimport openaifrom singlestoredb import connectfrom sqlalchemy import create_engine,textimport matplotlib.pyplot as pltfrom matplotlib.dates import DateFormatterimport matplotlib.dates as mdatesimport numpy as npimport datetimeimport plotly.express as pximport plotly.graph_objs as go# OpenAI API keyopenai.api_key ='sk-DXwarIh88HeIexe7VnwzX3BlbkFJOZim8Mot9Y2YHupF2n2x'# Create a connection to the databaseengine = create_engine(connection_url)# Query the database to get the data from 'telecom_cdr' table#df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type` FROMtelecom_cdr', engine)df = pd.read_sql('SELECT `Call_ID`, `Duration`, `Cost`, `Call_Type`,`Start_Time`, `Anomaly` FROM telecom_cdr', engine)# Function to generate embeddingsdef generate_embeddings(row):text = f"Duration: {row['Duration']}, Cost: {row['Cost']}, Call Type:{row['Call_Type']}"response = openai.embeddings.create(input=[text],model="text-embedding-ada-002")return response.data[0].embedding# Apply the function to each rowdf['Embeddings'] = df.apply(generate_embeddings, axis=1)with engine.connect() as connection:# Update the database with the new embeddingsfor index, row in df.iterrows():update_query = f"""UPDATE telecom_cdr SET `Embeddings` =JSON_ARRAY_PACK(:embeddings) WHERE Call_ID = :call_id"""connection.execute(text(update_query), {"embeddings":str(row['Embeddings']), "call_id": row['Call_ID']})with engine.connect() as connection1:update_query1 = f"""CALL detectanomaly();"""connection1.execute(text(update_query1))anomaly_query = """SELECTDATE_FORMAT(Start_Time, '%Y-%m') AS Month,Anomaly,COUNT(*) AS Anomaly_CountFROMtelecom_cdrGROUP BYMonth, Anomaly"""anomaly_df = pd.read_sql(anomaly_query, engine)# Pivot the dataframe to get a column for each anomaly typepivot_df = anomaly_df.pivot(index='Month', columns='Anomaly',values='Anomaly_Count').fillna(0)# Reset index to make 'Month' a column againpivot_df.reset_index(inplace=True)# Convert the pivoted dataframe into a format suitable for Plotlyplotly_data = pivot_df.melt(id_vars=['Month'], var_name='Anomaly',value_name='Anomaly_Count')# Ensure the 'Month' column is in datetime format for proper plottingplotly_data['Month'] = pd.to_datetime(plotly_data['Month'])# Create the line plot using Plotlyfig = px.line(plotly_data,x='Month',y='Anomaly_Count',color='Anomaly',title='Anomaly Counts Over Time',labels={'Anomaly_Count': 'Anomaly Count', 'Month': 'Month'})# Update layout for better readabilityfig.update_layout(xaxis_title='Month',yaxis_title='Anomaly Count',legend_title='Anomaly Types',xaxis_tickangle=-45)# Show the plotfig.show()# Close the database connectionengine.dispose()
2.3. Visual graph for anomaly detection.
3. Here are the steps demonstrating how how SingleStore works in the preceding Python program to detect anomalies for fraud detection
3.1. Importing required packages and defining OpenAI key
3.2. Create the database connection and query the database to get data from the table to mark anomalies.
3.3. Defining a function to generate embeddings using the OpenAI method text-embedding-ada-002 model.
3.4. Apply the function “generate_embeddings” to generate embeddings for each row. In this section I am also running one database stored procedure “detectanomaly” to mark an anomaly to each record if the row satisfies conditions mentioned in the preceding Explanation1 section.
How to mark anomalies for CDR
First, generate a base embedding for each condition mentioned in the Explanation1 section. These base embeddings help mark anomalies for each record near the base embeddings using the SingleStore DOT_PRODUCT vector function. In my case, I have considered the duration, cost and call_type column to generate base embeddings. You can also include other factors to generate this base embedding.
In the second step, get the embeddings of each CDR using the function “generate_embeddings”. This function is used in the previous step to generate embeddings for each CDR.
Finally, it's time to mark anomalies for each CDR matching or near the base embeddings. For this I have used one SingleStore stored procedure, “detectanomaly” which updates each the CDR with particular anomalies including 'IRSF,' 'Wangiri fraud' and 'PBX hacking'.
4. Criteria to mark anomalies
The stored procedure “detectanomaly” calculates whether each record is anomalous based on the SingleStore DOT_PRODUCT function of its embeddings with predefined vectors, each representing a specific type of fraud. If a certain threshold is met or exceeded, the record is flagged with the corresponding type of fraud.
DOT_PRODUCT() can be used to compute a cosine similarity metric of the two input vectors, if the input vectors are normalized to length 1.
In the stored procedure I have set the threshold 0.981 to mark anomalies of similar type based on a criteria of duration, cost and call_type — but you can set your own threshold to mark anomalies based on specific requirements.
Interested in trying out AI-driven fraud detection? Get started with SingleStore today.