Check repeated values in a dataframe and implement a ignoreNulls parameter-CodePudding

I created a function to check if there is repeated values in a dataframe based on a Seq of columns.

I want to implement an "ignoreNulls", to be passed as a Boolean parameter into the function

If true, will ignore and not group and count the nulls values. So for the nulls values, the "newColName" will return false.
If false (default), will consider nulls values as a group and return true if theres multiples values with nulls for the key that I'm checking.

I don't know how could I do this. Should I use an if or case? There's some expression to ignore nulls on partitionBy statement?

Anyone could help me?

Here's the current function

def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
    val repeatedCondition = $"sum" > 1
    val windowCondition   = Window.partitionBy(keys.head, keysToCheck.tail: _*)

    dataframe
      .withColumn("count", lit(1))
      .withColumn("sum", sum("count").over(windowCondition))
      .withColumn(newColName, repeatedCondition)
      .drop("count", "sum")
  }

Some test data

  val testDF = Seq(
      ("1", Some("name-1")),
      ("2", Some("repeated-name")),
      ("3", Some("repeated-name")),
      ("4", Some("name-4")),
      ("5", None),
      ("6", None)
    ).toDF("name_key", "name")

Testing the function

val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))

Output (without the ignoreNulls implementation)

 -------- --------------- -------------------- 
|name_key|       name    |  has_repeated_name |
 -------- --------------- -------------------- 
|     1  |      name-1   |              false |
 -------- --------------- -------------------- 
|     2  | repeated-name |               true |
 -------- --------------- -------------------- 
|     3  | repeated-name |               true |
 -------- --------------- -------------------- 
|     4  |      name-4   |              false |
 -------- --------------- -------------------- 
|     5  |         null  |               true |
 -------- --------------- -------------------- 
|     6  |         null  |               true |
 -------- --------------- --------------------

And with the ignoreNulls=true implementation should be like this


-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = 

-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)

-- expected output for nulls
 -------- --------------- -------------------- 
|     5  |         null  |              false |
 -------- --------------- -------------------- 
|     6  |         null  |              false |
 -------- --------------- --------------------

CodePudding user response：

Firstly, you should define properly the logic in case that only part of the columns in keys are null - should it be counted as null values or null value is defined only if all the columns in keys are null?

For the sake of simplicity, lets assume that the there is only one column in keys (you can easily extend the logic for multiple columns). You can just add a simple if into your checkRepeatedKey function:

def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key 
    col(keys.head).isNull
}


def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
    ...
    ...

    val df = dataframe
      .withColumn("count", lit(1))
      .withColumn("sum", sum("count").over(windowCondition))
      .withColumn(newColName, repeatedCondition)
      .drop("count", "sum")
    
    if (ignoreNulls) 
        df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false)) 
    else df
  }