Using a UDF with a REQUIRED column is creating a nullable column in my Spark Dataframe-CodePudding

I am trying to create a Dataframe to write to a Big Query table. One column in the Output table is a REQUIRED ID that I need to generate in my pipeline. I am doing this with the use of a UDF but no matter what I try the column is being created as nullable.

How I've created the UDF:

UserDefinedFunction genID = functions.udf(
                (UDF1<String, String>) this::generateEmailCommID, DataTypes.StringType);

The method the UDF calls itself:

 private String generateEmailID(String srcId) {
        return UUID.nameUUIDFromBytes(("1_"   srcId).getBytes()).toString();
    }

And then I use this on my temp view transformedData like this:

spark.sql("SELECT message_ID AS src_id FROM transformedData")
          .withColumn(email_id, genID.apply(functions.col("src_id")))

This column needs to be REQUIRED to match the output table and column "src_id: is 'nullable=false'. So why does "email_id" get created "nullable=true" and how can I stop that from happening so I can write to the table?

root
 |-- email_id: string (nullable = true)
 |-- src_id: string (nullable = false)

CodePudding user response：

That's probably how udf works. I would assume that Spark doesn't know what udf can return, so to be on a safe side, it makes the column nullable.

If you are sure you don't have nulls in the column, you can add coalesce("col_name", lit("")). I mean, depending on what import you have, you can use either

.withColumn("email_id", coalesce("email_id", lit("")))

.withColumn("email_id", functions.coalesce("email_id", functions.lit("")))