Hello, here is a response from SQrL
SingleStore Pipelines are designed to facilitate real-time data ingestion1. Once data is picked up from a file by a pipeline, it does not look for additional changes to the file or track deleted rows1. Pipelines handle files as if they’re append-only1.
However, you can leverage the ALTER PIPELINE ... DROP FILE
command to handle files that have been replaced or removed2. When you download new versions of your files, before placing them in the /var/lib/memsql/db_files
folder, you could run the ALTER PIPELINE ... DROP FILE
command for each of the existing files. That will make the pipeline forget all metadata associated with those files2. When the new files are placed in the directory, the pipeline will treat them as new files to be loaded2.
Regarding your database with potentially missing rows in new data sets, there are a few ways to handle this. One method involves SingleStore’s support for handling Slowly Changing Dimensions3. This approach means that changes to data including additions, updates, and deletions, are tracked as new rows in the database3. By applying this pattern to your situation, you could include a “version” or “datetime” field in your records, then when new files are loaded, older records that aren’t included in the new file could be marked as ‘expired’, based on their version or datetime3.
Here’s an example where pipeline is used to track changes to product information3.
sql
CREATE PIPELINE changing_dimension_pipeline
AS LOAD DATA FS '/product_history.txt'
INTO PROCEDURE changing_dimension_proc
FIELDS TERMINATED BY ',';
START PIPELINE changing_dimension_pipeline;
This pipeline loads data into a stored procedure, changing_dimension_proc
3. This procedure could contain logic for handling changes to existing records3.
For more customized operations, you can also use a stored procedure, where you can apply more complex logic on the incoming rows. If you need to ignore certain records, (such as header or trailer records) you can use a transform function as suggested in this forum post4. Lastly, column changes in the files can be handled using a JSON pipeline strategy5.
Please keep in mind that handling changing file formats and maintaining a real-time reflection of your source data in the database, particularly with deletions, can be complex. An alternative approach could involve creating a complete snapshot of your data with each daily load, and completely replacing the old data with the new.