"NULL" instead of null values in PySpark-CodePudding

I have a dataframe df, but because 3 of its columns that should normally be "double" have values like "NULL", the automatic type is cast as string.

df =

col_0      col_1      col_2            col_3
Word       73452     3859.0     4734.0000000
Word1      23452     3859.0             NULL
Word2      73452       NULL     4758.0000000
Word1      73454       NULL     4756.0000000
Word2      73452     3456.0     4758.0000000

I want to change this and my attempt was:

from pyspark.sql.types import DoubleType

def my_function(df):
    df_1 = df.withColumn("col_1", df["col_1"].cast(DoubleType()))
    df_2 = df_1.withColumn("col_2", df_1["col_2"].cast(DoubleType()))
    df_3 = df_2.withColumn("col_3", df_2["col_3"].cast(DoubleType()))
    return df_3

But I don't know how to replace the string "NULL" with an empty kind of null. And even so, would that be enough?

CodePudding user response：

You can try and replacing the strings with value NULL with the Python's None type and then casting to correct types, like this:

df = spark.createDataFrame([("Word1", 23452, 3859.0, "NULL"), ("Word2", 73452, "NULL", 4758.0000000)], "col_0: string, col_1: int, col_2: string, col_3: string")
print(df.dtypes)

# [('col_0', 'string'), ('col_1', 'int'), ('col_2', 'string'), ('col_3', 'string')]

from pyspark.sql.functions import col

print(df.na.replace("NULL", None).select(col("col_0"), col("col_1").cast("int"), col("col_2").cast("double"), col("col_3").cast("double")).dtypes)

# [('col_0', 'string'), ('col_1', 'int'), ('col_2', 'double'), ('col_3', 'double')]