Home > Software engineering >  What is the industry standard Deduping method in Dataflows?
What is the industry standard Deduping method in Dataflows?

Time:11-20

So Deduping is one of the basic and imp Datacleaning technique.

There are a number of ways to do that in dataflow.

Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName' $$ _____first($$) in aggregate tab.

The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.

Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?

CodePudding user response:

I tried to repro the deduplication process using dataflow. Below is the approach.

  • List of columns that needs to be grouped by are given in dataflow parameters. In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'

enter image description here

  • Source is taken as in below image. (Group By columns: col1, col2, col3; Aggregate column: col4)

enter image description here

  • Then Aggregate transform is taken and in group by, sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn enter image description here

  • In Aggregates, Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression. gif31

Output of aggregation function

enter image description here

Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.

  • Then using select transformation, groupbycolumn from above output can be removed before copying to sink.

Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

  • Related