I am trying to create a Dataframe to write to a Big Query table. One column in the Output table is a REQUIRED ID that I need to generate in my pipeline. I am doing this with the use of a UDF but no matter what I try the column is being created as nullable.
How I've created the UDF:
UserDefinedFunction genID = functions.udf(
(UDF1<String, String>) this::generateEmailCommID, DataTypes.StringType);
The method the UDF calls itself:
private String generateEmailID(String srcId) {
return UUID.nameUUIDFromBytes(("1_" srcId).getBytes()).toString();
}
And then I use this on my temp view transformedData like this:
spark.sql("SELECT message_ID AS src_id FROM transformedData")
.withColumn(email_id, genID.apply(functions.col("src_id")))
This column needs to be REQUIRED to match the output table and column "src_id: is 'nullable=false'. So why does "email_id" get created "nullable=true" and how can I stop that from happening so I can write to the table?
root
|-- email_id: string (nullable = true)
|-- src_id: string (nullable = false)
CodePudding user response:
That's probably how udf
works. I would assume that Spark doesn't know what udf
can return, so to be on a safe side, it makes the column nullable.
If you are sure you don't have nulls in the column, you can add coalesce("col_name", lit(""))
. I mean, depending on what import
you have, you can use either
.withColumn("email_id", coalesce("email_id", lit("")))
or
.withColumn("email_id", functions.coalesce("email_id", functions.lit("")))