When reading data from a text file using pyspark
using following code,
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')
My data text file looks like,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
But the output I got was,
col1|cpl2|col3|col4
112 |4344|fn1 | home_a
Is there a way to add those missing columns for the dataframe?
Expecting,
col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>
CodePudding user response:
You can explicitly specify the schema, instead of infering it.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType() \
.add("col1",StringType(),True) \
.add("col2",StringType(),True) \
.add("col3",StringType(),True) \
.add("col4",StringType(),True) \
.add("col5",StringType(),True) \
.add("col6",StringType(),True) \
.add("col7",StringType(),True) \
.add("col8",StringType(),True)
df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')
Output
---- ---- ---- ------- ------- --------- ------- --------
|col1|col2|col3| col4| col5| col6| col7| col8|
---- ---- ---- ------- ------- --------- ------- --------
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
---- ---- ---- ------- ------- --------- ------- --------