Adding missing columns to a dataframe pyspark-CodePudding

When reading data from a text file using pyspark using following code,

spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')

My data text file looks like,

col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

But the output I got was,

col1|cpl2|col3|col4
112 |4344|fn1 | home_a

Is there a way to add those missing columns for the dataframe?

Expecting,

col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

CodePudding user response：

You can explicitly specify the schema, instead of infering it.


from pyspark.sql.types import StructType,StructField, StringType, IntegerType 
schema = StructType() \
      .add("col1",StringType(),True) \
      .add("col2",StringType(),True) \
      .add("col3",StringType(),True) \
      .add("col4",StringType(),True) \
      .add("col5",StringType(),True) \
      .add("col6",StringType(),True) \
      .add("col7",StringType(),True) \
      .add("col8",StringType(),True) 

df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')

Output

 ---- ---- ---- ------- ------- --------- ------- -------- 
|col1|col2|col3|   col4|   col5|     col6|   col7|    col8|
 ---- ---- ---- ------- ------- --------- ------- -------- 
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
 ---- ---- ---- ------- ------- --------- ------- --------