Pretty junior data person here, so apologies in advance if this is a dumb one/I'm thinking about it in the wrong way. Working with Scala in Databricks.
My objective is to evaluate a dataframe, and return some data quality checks for a shift scheduling system.
Today I'm doing this:
employees
.withColumn(
"record_validity",
split(
concat_ws(
",",
when($"employmentPeriod"<0, "corrupt:date:negative-period"),
when($"deactivationDate".isNull && $"hiredDate".isNull, "corrupt:date:missing"),
when($"synthetic_hiredDate".isNull, "corrupt:date:deactivated-after-creation"),
when($"hiredDate".isNull, "corrupt:date:missing-hire-date"),
when($"deactivationDate".isNull && $"employment"==="active", "valid:implicit")
),
","
)
)
Which returns:
["corrupt:date:missing", "corrupt:date:deactivated-after-creation", "corrupt:date:missing-hire-date", "valid:primary-department:missing"]
However, I'd like to instead return an array of structs as these checks are going to become more complicated & the eng team would prefer to get it in this format:
[{ validationCategory: "corrupt", validationError: "date-negative-period" }
{ validationCategory: "corrupt", validationError: "date-missing" }
{...etc...}]
I tried modifying my when
to return a struct instead of a string, but it can't (if my other SO reading is correct). So I'm a bit stumped as to how to continue - any hints appreciated.
CodePudding user response:
How about the idea of maintaining logic separately in tuples of the kind (category, error, condition)
.
You can then manipulate this list of tuples to create structure you require:
val validations =
List(
("corrupt", "date-negative-period",
$"employmentPeriod"<0),
("corrupt", "date-missing",
$"deactivationDate".isNull && $"hiredDate".isNull),
("corrupt", "date-deactivated-after-creation",
$"synthetic_hiredDate".isNull),
("corrupt", "date-missing-hire-date",
$"hiredDate".isNull),
("valid", "implicit",
$"deactivationDate".isNull && $"employment"==="active")
)
def validation_list =
validations.map{
case (category, error, condition) =>
when(
condition,
struct(
lit(category).as("validationCategory"),
lit(error).as("validationError")
)
)
}
Then from this list of validation columns you will build your array column, possibly filtering out the null results (i.e. null
- validation check that passed):
val df = employees.withColumn("record_validity", filter(array(validation_list:_*), x => x.isNotNull))
df.printSchema
root
...
|-- record_validity: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- validationCategory: string (nullable = false)
| | |-- validationError: string (nullable = false)