We have all heard about structured, semi-structured and unstructured data, while referring to data in a variety of formats. And sometimes, you might wonder how they are distinct from each other — and how a database processes and stores them. In this blog, we’ll take a deep dive into these data formats.
Structured data
Structured data refers to information that is organized in a format that is easily understandable by both humans and machines. In structured data, the elements or fields are clearly defined, and there is a well-defined schema or model that governs the relationships and properties of these elements. This organization allows for efficient storage, retrieval and analysis of data.
Structured data is typically found in relational databases, spreadsheets and other tabular formats. Each piece of data is assigned to a specific category or field, and relationships between different pieces of data are explicitly defined.
Examples of structured data include tables with rows and columns, where each column represents a specific attribute or property, and each row corresponds to a unique record or entry. Structured data is commonly used in various applications including business databases, financial systems and other information management systems where organization and consistency are essential.
Semi-structured data
This is where things get in the gray area. As the name suggests, this type of data has some structure — but leaves a lot of area open for some non-structured data, as well.
XML, JSON, Avro, Parquet are all categorized as semi-structured data. The data from sensors and server logs can easily be categorized in this bucket, since it generally appears in or can easily be converted to JSON/CSV-like formats.
Some data vendors even categorize HTML code and emails as semi-structured data, while others put them in unstructured buckets. Think of email as a JSON object with fields like sender, recipients, subject, time, etc., and it is semi-structured. But if you think of it with some attachments like media and PDF files, it becomes unstructured.
Here’s a more formal definition of semi-structured data:
Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.
In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important.
SingleStore has native support for storing and processing JSON data. SingleStore Pipelines support ingesting the data from Parquet, CSV and Avro formats and SELECT INTO. Additionally, an S3 statement allows dumping the data in Parquet form as well.
Unstructured data
Text files, PDFs, images, video and audio files, etc. can all be categorized as unstructured data.
This data can be stored inside the table as a Binary Large Object (BLOB), but is more suitable for storing in file systems or object storages due to their potential size — especially due to their unique requirements of backup and compliance policies.
However, the metadata and vector embeddings associated with this kind of data still needs to be stored inside the database to increase the discoverability and usability of this data.
Metadata generally represents the file name, file uri, size, type, owner, creation date, etc., within the file system/object storage. It may also constitute deeper details like extracted text, object boundaries, etc. As you can imagine, this data can be easily represented in a relational table-like structured format or in semi-structured form — like a JSON column — or a combination of both.
The user can enter this data manually or engage an external machine learning model to generate it for the unstructured data. SingleStore also allows you to use external ML models and libraries to create, store and process vector embeddings for your unstructured data. This helps in creating real-time AI applications.
Imagine this scenario: The senior executive of a company wants to get all PDF documents where a particular feature has been mentioned; a journalist in a media channel wants to get all videos, images and news articles where a specific politician appeared; an accountant wants to get the summary of a company's latest annual report. The applications are endless, but it boils down to the same pieces: each is an example of finding unstructured data.
Processing structured data in SingleStore
Let’s take the example of customer information with a set of well defined fields. We can organize this in a table with rows and columns. We can then use simple SQL queries to perform operations on this data.
DROP DATABASE IF EXISTS structured;
CREATE DATABASE structured;
USE structured;
CREATE TABLE customer(
c_custkey int(11) NOT NULL,
c_name varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
c_address varchar(40) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
c_nationkey int(11) NOT NULL,
c_phone char(15) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
c_acctbal decimal(15,2) NOT NULL,
c_mktsegment char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
c_comment varchar(117) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY pk ( c_custkey ) UNENFORCED RELY,
SHARD KEY ( c_custkey ) USING CLUSTERED COLUMNSTORE
);
SELECT * FROM customer LIMIT 5;
We can also run complex aggregates (like SUM) of the balance of all customers in a particular market segment.
Processing semi-structured data in SingleStore
Let’s take a look at an example around processing JSON data. For this scenario, we are using the Nobel Prize dataset and storing it in a JSON column inside SingleStore.
CREATE DATABASE semistructured;
USE semistructured;
CREATE TABLE nobelprize (
id INT AUTO_INCREMENT PRIMARY KEY,
description JSON NOT NULL);
---
import requests
import json
import singlestoredb as s2
response_API = requests.get('https://api.nobelprize.org/v1/prize.json')
data = response_API.text
parse_json = json.loads(data)
prizes = parse_json['prizes']
data_to_insert = [(json.dumps(x),) for x in prizes]
conn = s2.connect(connection_url)
stmt = 'INSERT INTO nobelprize (description) VALUES (%s)'
with conn:
conn.autocommit(True)
with conn.cursor() as cur:
cur.executemany(stmt, data_to_insert)
We then query and aggregate the JSON fields and array elements within this JSON using SQL — the same can be applied to more complex JSON datasets and queries.
SingleStore Cloud also supports working with semi-structured data using SingleStore Kai™, our API for applications built on MongoDB®.
Processing unstructured data in SingleStore
Here are the high-level steps for storing and processing unstructured data in SingleStore:
1. Store the unstructured data in a file system/object storage
2. Generate the metadata and vector embeddings of the data and store that in SingleStore
Metadata information may be supplied by the application itself, or can be extracted using an ML model. Vector embeddings can be generated using ML models trained for specific file types like Facenet for images, different ASR models for audio or an LLM like GPT-4/LLaMA for text. These may also be ML models trained for domain-specific tasks like analysis of medical records.
Moreover, for data with multiple formats like text, image, audio and video, models like multi2vec-clip can be used to generate consistent embedding and allow searching through different media.
In SingleStore, external functions can also be used to call the model endpoints. Metadata can be stored simply in columns, or with additional JSON columns for more complex structures. Vector embeddings can be stored in a column within the same table.
3. Your application queries SingleStore to get the list of files that match a specific query, either sending that list to the user or to an ML model for further analysis such as summarization.
A single query can be run against a combination of application data, metadata and vector embeddings leveraging JOINS in SingleStore.
Our customers have built powerful applications like video security, anti-human trafficking tools and numerous Retrieval Augmented Generation (RAG) applications using this capability of SingleStore.
Some examples are described in these blogs and webinars:
- Image Matching in SingleStore
- Query PDFs using SingleStore and LangChain
- How to build LLMs that hear and speak
- Working with Vector Data
- Build a Q+A RAG System with SingleStore
Interested in working with structured, semi-structured and unstructured data yourself? Activate your free SingleStore trial today.