Home > Back-end >  can we provide default values to a column of a custom schema in pyspark
can we provide default values to a column of a custom schema in pyspark

Time:01-05

I'm new to pyspark , please do let me know if you have a solution for this

I created a custom schema in pyspark , like below

    from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType,StructField, StringType, IntegerType
    
    structureSchema = StructType([
             StructField('col1', StringType(), True),
             StructField('col2', StringType(), True),
             StructField('col3', StringType(), True)   ,
               ])

I have a text file containing multiple jsons , something like this

{'col1':'abc','col2':'abc1','col3':'qwe'}
{'col1':'abc','col2':'abc1'}
{'col1':'abc''col3':'qwe'} .

When I load this file to the custom schema created ,it fills the missing columns entries with null value.

df=spark.read.schema(structureSchema).json(fpath)   

col1      col2     col3
abc       abc1     qwe
abc       abc1     null
abc       null     null

Is there a way I can fill them with a default value-"NoValueReceived" instead of "null" , like below

col1      col2     col3
abc       abc1     qwe
abc       abc1     NoValueReceived
abc       NoValueReceived  NoValueReceived

CodePudding user response:

In PySpark, DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

Reference: https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/

df = spark.read.schema(structureSchema).json(fpath)
df = df.na.fill(value="NoValueReceived",subset=["col1", "col2", "col3"])

CodePudding user response:

In this case I think you can check if you have or not that column in you incoming data, like

from pyspark.sql.functions import coalesce

# Read in the JSON file with a pre-defined schema
df = spark.read.json("path/to/file.json", schema=schema)

# Fill in missing values with a default value
df = df.withColumn("column_name", coalesce(df["column_name"], lit("default_value")))

  • Related