Dataflow streaming template for data masking/tokenization giving inconsistent results-CodePudding

The Google provided Dataflow Streaming template for data masking/tokenization from cloud storage to bigquery using cloud DLP is giving inconsistent output for each source files.

We have 100 odd files with 1M records each in the GCS bucket and we are calling the dataflow streaming template to tokenize the data using DLP and load into BigQuery.

While loading the files sequentially we saw that the results are inconsistent

For few files full 1M got loaded but for most of them the rows are varied between 0.98M to 0.99M. Is there any reason for such behaviour?

CodePudding user response：

I am not sure but it's maybe due to BigQuery best-effort deduplication mechanism used for streaming data to BigQuery :

From the Beam documentation :

Note: Streaming inserts by default enables BigQuery best-effort deduplication mechanism. You can disable that by setting ignoreInsertIds. The quota limitations are different when deduplication is enabled vs. disabled :

Streaming inserts applies a default sharding for each table destination. You can use withAutoSharding (starting 2.28.0 release) to enable dynamic sharding and the number of shards may be determined and changed at runtime. The sharding behavior depends on the runners.

From the Google Cloud documentation :

Best effort de-duplication When you supply insertId for an inserted row, BigQuery uses this ID to support best effort de-duplication for up to one minute. That is, if you stream the same row with the same insertId more than once within that time period into the same table, BigQuery might de-duplicate the multiple occurrences of that row, retaining only one of those occurrences.

The system expects that rows provided with identical insertIds are also identical. If two rows have identical insertIds, it is nondeterministic which row BigQuery preserves.

De-duplication is generally meant for retry scenarios in a distributed system where there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

De-duplication offered by BigQuery is best effort, and it should not be relied upon as a mechanism to guarantee the absence of duplicates in your data. Additionally, BigQuery might degrade the quality of best effort de-duplication at any time in order to guarantee higher reliability and availability for your data.

If you have strict de-duplication requirements for your data, Google Cloud Datastore is an alternative service that supports transactions.

This mecanism can be disabled with ignoreInsertIds

You can test with disabling this mecanism and check if all the rows are inserted.

CodePudding user response：

By adjusting the value of the batch size in the template all files of 1M records each got loaded successfully