Hi everyone,
I am currently using SingleStore Self Managed v8.1 with bottomless storage on S3 buckets. I have encountered an issue where the data volume—including actual data, blob cache, etc.—is abnormally inflated when SingleStore pushes data to the S3 bucket after processing or ingesting data via pipelines.
On the user side, they declared ingesting about 400–500GB of actual data and created approximately 4–10 new tables. However, some of their queries are reading an extremely high number of rows (around 3 billion records per day, querying data spanning 30 to 60 days at a time) even though they are using simple joins. Yet, in the S3 bucket associated with the database/schema, the stored data grew up to 10TB, and then increased to 30TB in just 1 to 2 hours (these numbers represent the usable data; the raw data is approximately 1.5 times larger).
Previously, we encountered a similar scenario where a user imported only about 5GB of data, but the corresponding S3 bucket swelled up to 10TB. At that time, we immediately stopped the pipelines, halted table creation, and deleted some of the imported data in hopes of mitigating the issue. Eventually, everything stabilized after about 3 days.
SELECT
COUNT(*) AS numSegments,
SUM(size) / (1024 * 1024 * 1024) AS blobdataSizeGB
FROM
mv_cached_blobs
WHERE
database_name = '<database>'
AND type = 'primary';
At the moment it reached 10TB in S3 bucket, we used the above queries to check some statistics and found nearly 10 million segments, with a blob data size of approximately 30 TB…
Is this unexpected behavior when SingleStore interacts with Bottomless storage? I have also used SingleStore versions 8.7 and 8.9 for dev but have never encountered this issue (or perhaps my data volume or workload wasn’t as intensive as this). Does this behavior tend to occur more frequently in the earlier versions of SingleStore, or is it something that might also appear in higher versions?
I would appreciate any insights or additional information regarding this case.
Thanks to all