I have a csv file having 300 columns. Out of these 300 columns, I need only 3 columns. Hence i defines schema for same. But when I am mapping schema to dataframe it shows only 3 columns but incorretly mapping schema with first 3 columns. Its not mapping csv columns names with my schema structfields. Please advise
from pyspark.sql.types import *
dfschema = StructType([
StructField("Call Number",IntegerType(),True),
StructField("Incident Number",IntegerType(),True),
StructField("Entry DtTm",DateType() ,True)
])
df = spark.read.format("csv")\
.option("header","true")\
.schema(dfschema)\
.load("/FileStore/*/*")
df.show(5)
CodePudding user response:
This is actually the expected behaviour of Spark's CSV-Reader.
If the columns in the csv file do not match the supplied schema, Spark treats the row as a corrupt record. The easiest way to see that is to add another column _corrupt_record
with type string to the schema. You will see that all rows are stored in this column.
The easiest way to get the correct columns would be to read the the csv file without schema (or if feasible with the complete schema) and then select the required columns. There will be no performance penalty for reading the whole csv file as (unlike in formats like parquet) Spark cannot read selected columns from csv. The file is always read completely.