Home > Net >  Creating an array of structs from a series of when
Creating an array of structs from a series of when

Time:08-24

Pretty junior data person here, so apologies in advance if this is a dumb one/I'm thinking about it in the wrong way. Working with Scala in Databricks.

My objective is to evaluate a dataframe, and return some data quality checks for a shift scheduling system.

Today I'm doing this:

employees
.withColumn(
      "record_validity", 
  split(         
    concat_ws(
      ",",
        when($"employmentPeriod"<0, "corrupt:date:negative-period"),
        when($"deactivationDate".isNull && $"hiredDate".isNull, "corrupt:date:missing"),
        when($"synthetic_hiredDate".isNull, "corrupt:date:deactivated-after-creation"),
        when($"hiredDate".isNull, "corrupt:date:missing-hire-date"),
        when($"deactivationDate".isNull && $"employment"==="active", "valid:implicit")
                    ),
        ","
      )
    )

Which returns:

["corrupt:date:missing", "corrupt:date:deactivated-after-creation", "corrupt:date:missing-hire-date", "valid:primary-department:missing"]

However, I'd like to instead return an array of structs as these checks are going to become more complicated & the eng team would prefer to get it in this format:

[{ validationCategory: "corrupt", validationError: "date-negative-period" }
 { validationCategory: "corrupt", validationError: "date-missing" } 
 {...etc...}]

I tried modifying my when to return a struct instead of a string, but it can't (if my other SO reading is correct). So I'm a bit stumped as to how to continue - any hints appreciated.

CodePudding user response:

How about the idea of maintaining logic separately in tuples of the kind (category, error, condition).

You can then manipulate this list of tuples to create structure you require:

val validations = 
  List(
    ("corrupt", "date-negative-period", 
      $"employmentPeriod"<0),
    ("corrupt", "date-missing", 
      $"deactivationDate".isNull && $"hiredDate".isNull),
    ("corrupt", "date-deactivated-after-creation", 
      $"synthetic_hiredDate".isNull),
    ("corrupt", "date-missing-hire-date", 
      $"hiredDate".isNull),
    ("valid", "implicit", 
      $"deactivationDate".isNull && $"employment"==="active")
  )

def validation_list = 
  validations.map{
    case (category, error, condition) => 
      when(
        condition,
        struct(
          lit(category).as("validationCategory"),
          lit(error).as("validationError")
        )
      )
  }

Then from this list of validation columns you will build your array column, possibly filtering out the null results (i.e. null - validation check that passed):

val df = employees.withColumn("record_validity", filter(array(validation_list:_*), x => x.isNotNull))
df.printSchema
root
 ...
 |-- record_validity: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- validationCategory: string (nullable = false)
 |    |    |-- validationError: string (nullable = false)
  • Related