CDC-out & Observe Feedback/Clarification

tom · July 2, 2024, 8:17am

Hi,
i am currently implementing a Bytewax Connector for SingleStore (GitHub - tomkuehl/singlestore-bytewax-connector). I am facing two challenges:

Delete events only include an internal id. The docs states, that the internal id is “An internal identifier for tables that do not have primary keys, which allows identification of records in the absence of a primary key.” (Change Data Capture · SingleStore Documentation). I guess this doesn’t work for delete events? Is the recommended approach to keep the internal id as an “absolute identifier” in downstream systems? Are there any plans to include the primary key in the payload of delete events? I would prefer the latter, but totally get it if it’s a technical obstacle.
Currently there is no way to check the “total offset” of a table, is it? I want to implement some kind of reporting mechanism to keep track of the stream lag. Are there any plans to integrate this into OBSERVE? Only idea i have as of now is to kick off two observe queries: the first one will just consume the stream to get to the latest event as fast as possible, the second one will actually be the processed data stream. This way i could just do: first_event_count - second_last_event = lag. But it sounds pretty stupid to do it this way Happy for any insights or maybe things i am missing out.

Thanks & best,
Tom

tom · July 2, 2024, 8:47pm

Is it safe to use MV_COLLECTED_ROW_CHANGE_COUNTS or MV_ROW_CHANGE_COUNTS to implement the offset lag?

Does “This view is primarily for internal use” translates into “This view may break in future patch releases without notice”?

sderoeck · July 3, 2024, 9:43pm

Hi Tom, thank for working on the Bytewax integration !

The intent is for the InternalId to be a fallback unique identifier in case the table schema in question is missing a primary key. In case the table does have (one or more) primary key column(s), the respective columns should be populated for DELETE records. Now there’s currently one major caveat in that it depends on the table type. For Rowstore tables, the primary keys are populated as intended. Columnstore tables currently don’t (always) serialize the primary/unique keys to disk, meaning we don’t (always) have access to them. We’re currently working on always exposing those keys if they exist. Right now we simply always fall back to the InternalId for Columnstore tables.

For the scenario where only the InternalId is available, you’ll either have to track it on the remote system or somewhere external in order to match which record was deleted.

We’re working on exposing a few key Offsets that might be of interest to consumers, e.g. TAIL of the log, HEAD of the log & the start of available snapshots. In the interest of measuring e.g. backpressure or delay, you can compute the logical offset of the current TAIL vs your last seen Offset for each partition.
However, (there’s always another *), the Offset delta has very little correlation to how much data is still to be processed. E.g. an Offset difference of 1000 could imply anything from 1 to a few million records were modified. So logical Offset distance might not be the best metric for backlog size. However to monitor latency and/or stalling CDC streams this should work fine.

I’ll make a note to update this post once we’ve added both changes.

tom · July 4, 2024, 9:00am

Thank you!

Perfect, good to know! Is there an ETA for exposing the keys for columnstore tables?
Is it safe to use mv_collected_row_change_counts? Currently I implemented the stream lag by counting the events (Inserts, Updates and Deletes) and subtract the row changes returned from the table. It seems to work pretty well. The table is not updated in real-time, but thats totally fine IMO. I think this is better to track the backlog size, than what you described. Or am I missing something?

EllieWright · July 22, 2024, 5:11am

I am replying here for the updates.