I'm new to pyspark , please do let me know if you have a solution for this
I created a custom schema in pyspark , like below
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
structureSchema = StructType([
StructField('col1', StringType(), True),
StructField('col2', StringType(), True),
StructField('col3', StringType(), True) ,
])
I have a text file containing multiple jsons , something like this
{'col1':'abc','col2':'abc1','col3':'qwe'}
{'col1':'abc','col2':'abc1'}
{'col1':'abc''col3':'qwe'} .
When I load this file to the custom schema created ,it fills the missing columns entries with null value.
df=spark.read.schema(structureSchema).json(fpath)
col1 col2 col3
abc abc1 qwe
abc abc1 null
abc null null
Is there a way I can fill them with a default value-"NoValueReceived" instead of "null" , like below
col1 col2 col3
abc abc1 qwe
abc abc1 NoValueReceived
abc NoValueReceived NoValueReceived
CodePudding user response:
In PySpark, DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.
Reference: https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/
df = spark.read.schema(structureSchema).json(fpath)
df = df.na.fill(value="NoValueReceived",subset=["col1", "col2", "col3"])
CodePudding user response:
In this case I think you can check if you have or not that column in you incoming data, like
from pyspark.sql.functions import coalesce
# Read in the JSON file with a pre-defined schema
df = spark.read.json("path/to/file.json", schema=schema)
# Fill in missing values with a default value
df = df.withColumn("column_name", coalesce(df["column_name"], lit("default_value")))