I have a dataframe df, but because 3 of its columns that should normally be "double" have values like "NULL", the automatic type is cast as string.
df =
col_0 col_1 col_2 col_3
Word 73452 3859.0 4734.0000000
Word1 23452 3859.0 NULL
Word2 73452 NULL 4758.0000000
Word1 73454 NULL 4756.0000000
Word2 73452 3456.0 4758.0000000
I want to change this and my attempt was:
from pyspark.sql.types import DoubleType
def my_function(df):
df_1 = df.withColumn("col_1", df["col_1"].cast(DoubleType()))
df_2 = df_1.withColumn("col_2", df_1["col_2"].cast(DoubleType()))
df_3 = df_2.withColumn("col_3", df_2["col_3"].cast(DoubleType()))
return df_3
But I don't know how to replace the string "NULL" with an empty kind of null. And even so, would that be enough?
CodePudding user response:
You can try and replacing the strings with value NULL
with the Python's None
type and then casting to correct types, like this:
df = spark.createDataFrame([("Word1", 23452, 3859.0, "NULL"), ("Word2", 73452, "NULL", 4758.0000000)], "col_0: string, col_1: int, col_2: string, col_3: string")
print(df.dtypes)
# [('col_0', 'string'), ('col_1', 'int'), ('col_2', 'string'), ('col_3', 'string')]
from pyspark.sql.functions import col
print(df.na.replace("NULL", None).select(col("col_0"), col("col_1").cast("int"), col("col_2").cast("double"), col("col_3").cast("double")).dtypes)
# [('col_0', 'string'), ('col_1', 'int'), ('col_2', 'double'), ('col_3', 'double')]