I created a function to check if there is repeated values in a dataframe based on a Seq of columns.
I want to implement an "ignoreNulls", to be passed as a Boolean parameter into the function
- If true, will ignore and not group and count the nulls values. So for the nulls values, the "newColName" will return false.
- If false (default), will consider nulls values as a group and return true if theres multiples values with nulls for the key that I'm checking.
I don't know how could I do this.
Should I use an if
or case
?
There's some expression to ignore nulls on partitionBy statement?
Anyone could help me?
Here's the current function
def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
val repeatedCondition = $"sum" > 1
val windowCondition = Window.partitionBy(keys.head, keysToCheck.tail: _*)
dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
}
Some test data
val testDF = Seq(
("1", Some("name-1")),
("2", Some("repeated-name")),
("3", Some("repeated-name")),
("4", Some("name-4")),
("5", None),
("6", None)
).toDF("name_key", "name")
Testing the function
val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))
Output (without the ignoreNulls implementation)
-------- --------------- --------------------
|name_key| name | has_repeated_name |
-------- --------------- --------------------
| 1 | name-1 | false |
-------- --------------- --------------------
| 2 | repeated-name | true |
-------- --------------- --------------------
| 3 | repeated-name | true |
-------- --------------- --------------------
| 4 | name-4 | false |
-------- --------------- --------------------
| 5 | null | true |
-------- --------------- --------------------
| 6 | null | true |
-------- --------------- --------------------
And with the ignoreNulls=true implementation should be like this
-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame =
-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)
-- expected output for nulls
-------- --------------- --------------------
| 5 | null | false |
-------- --------------- --------------------
| 6 | null | false |
-------- --------------- --------------------
CodePudding user response:
Firstly, you should define properly the logic in case that only part of the columns in keys
are null - should it be counted as null values or null value is defined only if all the columns in keys
are null?
For the sake of simplicity, lets assume that the there is only one column in keys
(you can easily extend the logic for multiple columns). You can just add a simple if
into your checkRepeatedKey
function:
def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key
col(keys.head).isNull
}
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
...
...
val df = dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
if (ignoreNulls)
df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false))
else df
}