I'm building a repository to test a list of data and I intend to gather errors in a single column of array type. Therefore, I create the column first, then perform each test, and if one fails, I add an error message in the dedicated column.
My concern is when I add the first value: I still have a null
value and I cannot manage to avoid this nor to remove it.
Here is some pieces of code I tried.
First version: the basic way I would expect:
df = df.withColumn("errors", F.array(F.lit(None)))
df = df.withColumn(
"errors",
F.when(
((F.col("col1").isNull()) | ~(F.col("col1").rlike(r"\b[A-Z]{3,3}"))),
F.array_union(F.col("errors"), F.array(F.lit("col1 is not valid"))),
),
)
The result when there is an error: [null, col1 is not valid]
I tried to check if there already are some values in the array:
df = df.withColumn("errors", F.array(F.lit(None)))
df = df.withColumn(
"errors",
F.when(
((F.col("col1").isNull()) | ~(F.col("col1").rlike(r"\b[A-Z]{3,3}"))),
F.when(
F.size(F.col("errors")) == -1, F.array(F.lit("col1 is not valid"))
).otherwise(
F.array_union(F.col("errors"), F.array(F.lit("col1 is not valid")))
),
),
)
but the result is exactly the same.
Any idea to manage this? Should I proceed a different way?
CodePudding user response:
Try with this instead:
df = df.withColumn("errors", F.lit(None).cast("array<string>"))
df = df.withColumn(
"errors",
F.when(
F.col('col1').isNull() | ~F.col('col1').rlike(r'\b[A-Z]{3,3}'),
F.coalesce(
F.concat(F.col("errors"), F.array(F.lit("col1 is not valid"))),
F.array(F.lit("col1 is not valid"))
)
)
)
errors
column is first created with null value, then use coalesce function
Or if you prefer, you can remove the null
value from errors
column at the end like this:
.withColumn(
"errors",
F.array_except("errors", F.array(F.lit(None)))
)
# or using filter
.withColumn(
"errors",
F.expr("filter(errors, x -> x is not null)")
)