How to fully Automate CDC in SQL Server?-CodePudding

Is there a way to 100% automate SQL Server CDC initialization in an active SQL Server database? I am trying to solve a problem finding from_lsn during first cdc data capture.

Sequence of events:

Enable CDC on given database/Table
Copy full table to destination (Data lake)
Use CDC to capture first delta (I want to avoid duplicates, without missing a transaction)

Problem:

How to get the from_lsn for fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>') function

Note:

Need to automate 100%
Can not stop transactions on the table
Can not miss any data or can not afford duplicate data

CodePudding user response：

Before doing the initial load, get the value of fn_cdc_get_max_lsn() and store it. This function returns the highest LSN known to CDC across all capture instances. It's the high water mark for the whole database.

Copy the whole table.

Start your delta process. The first time you call the delta function, the value of the min_lsn argument will be the stored value previously retrieved from fn_cdc_get_max_lsn() incremented by fn_cdc_increment_lsn. Get the current value from fn_cdc_get_max_lsn() (not the stored one) and use it as the value of the max_lsn argument.

From here proceed as you expect. Take the maximum LSN returned from the delta function, store it. Next time you pull a delta, use fn_cdc_increment_lsn on the stored value, use the result as the value of the min_lsn argument, and use the result of fn_cdc_get_max_lsn() as the max_lsn argument.

With this process you will never miss any data. (Not covered here: be sure to check that your boundary conditions fall within a valid lsn range)

Now, you mentioned that you want to avoid "duplicates". But if you try to define what a "duplicate" is in this scenario, I think you'll find it difficult.

For example, suppose I have this table to begin with:

create table t(i int primary key, c char);
insert t(i, c) values (1, 'a');

I call fn_cdc_get_max_lsn() and get 0x01.
A user inserts a new row into the table: insert t(i, c) values (2, 'b');
The user operation is associated with an LSN value of 0x02.
I select all the rows in this table (getting two rows).
I write both rows to my destination table.
I start my delta process. My min_lsn argument will be 0x02.

I will therefore get the {2, 'b'} row in the delta.

But I already retrieved the row {2, 'b'} as part of my initial load. Is this a "duplicate"? No, this represents a change to the table. What will I do with this delta when I load it into my destination? There are really only two options.

Option 1: I am going to merge the delta into the destination table based on the primary key. In that case, when I merge the delta I will overwrite the already-loaded row {2, 'b'} with the new row {2, 'b'}, the outcome of which looks the same as not doing anything.

Option 2: I am going to append all changes to the destination. In that case my destination table will contain the row {2, 'b'} twice. Is this a duplicate? No, because the two rows represent the how the data looked at different logical times. First when I did the initial load, and then when I did the delta.

If you try to argue that this is in fact a duplicate, then I counter by giving you this hypothetical scenario:

You do the initial load, receiving row {1, 'a'},
No users change any data.
You get your first delta, which is empty.
A user executes update T set c = 'b' where i = 1.
You get your second delta, which will include the row {1, 'b'}.
A user executes update T set c = 'a' where i = 1.
You get your third delta, which will include the row {1, 'a'}.

Question: Is the row you retrieved during your third delta a "duplicate"? It has the same values as a row we already retrieved previously.

If your answer is "yes", then you can never eliminate "duplicate" reads, because a "duplicate" will occur any time a row mutates to have the same values it had at some previous point in time, which is something over which you have no control. If this is a "duplicate" that you need to eliminate in the append scenario, then that elimination must be performed at the destination, by comparing the incoming values with the existing values.