Home > OS >  pySpark : add value to an empty array without 'null'
pySpark : add value to an empty array without 'null'

Time:02-19

I'm building a repository to test a list of data and I intend to gather errors in a single column of array type. Therefore, I create the column first, then perform each test, and if one fails, I add an error message in the dedicated column.

My concern is when I add the first value: I still have a null value and I cannot manage to avoid this nor to remove it.

Here is some pieces of code I tried.
First version: the basic way I would expect:

df = df.withColumn("errors", F.array(F.lit(None)))

df = df.withColumn(
    "errors",
    F.when(
        ((F.col("col1").isNull()) | ~(F.col("col1").rlike(r"\b[A-Z]{3,3}"))),
        F.array_union(F.col("errors"), F.array(F.lit("col1 is not valid"))),
    ),
)

The result when there is an error: [null, col1 is not valid]

I tried to check if there already are some values in the array:

df = df.withColumn("errors", F.array(F.lit(None)))

df = df.withColumn(
    "errors",
    F.when(
        ((F.col("col1").isNull()) | ~(F.col("col1").rlike(r"\b[A-Z]{3,3}"))),
        F.when(
            F.size(F.col("errors")) == -1, F.array(F.lit("col1 is not valid"))
        ).otherwise(
            F.array_union(F.col("errors"), F.array(F.lit("col1 is not valid")))
        ),
    ),
)

but the result is exactly the same.

Any idea to manage this? Should I proceed a different way?

CodePudding user response:

Try with this instead:

df = df.withColumn("errors", F.lit(None).cast("array<string>"))

df = df.withColumn(
    "errors",
    F.when(
        F.col('col1').isNull() | ~F.col('col1').rlike(r'\b[A-Z]{3,3}'),
        F.coalesce(
            F.concat(F.col("errors"), F.array(F.lit("col1 is not valid"))),
            F.array(F.lit("col1 is not valid"))
        )
    )
)

errors column is first created with null value, then use coalesce function


Or if you prefer, you can remove the null value from errors column at the end like this:

.withColumn(
    "errors",
    F.array_except("errors", F.array(F.lit(None)))
)

# or using filter
.withColumn(
    "errors",
    F.expr("filter(errors, x -> x is not null)")
)
  • Related