I wrote a file to parquet containing 1,000,000 rows.
When I read the parquet file back, the result is 1,000,000 rows.
df = spark.read.parquet(parquet_path)
df.count()
>>> 1000000
When I save the parquet file to CSV, read it back, then count the rows, the output is 1,000,365.
df.write.csv(csv_path, sep='\t', header=False, mode='overwrite')
df_csv = spark.read.csv(csv_path, sep='\t', header=False)
df_csv.count()
>>> 1000365
Why is Spark adding the extra columns to dataset?
I tried a variety of separators and both False
and True
for the header.
I also tried to coalesce and repartition but the same number keeps appearing.
Does anyone know why Spark would add the extra columns?
CodePudding user response:
This occurs while reading csv files which includes newlines. The newlines will split into multiple records by default.
To read the multiline csv correctly, you need to first make sure the field is properly quoted.
1,short text,"long text with newline\n quote within this field should be \"escaped\"."
Although double quote is the default, the quote char could be other characters than double quote.
Check the default csv options here: https://spark.apache.org/docs/latest/sql-data-sources-csv.html
Then, when you read this csv including the newline (\n). You need to add multiLine=True
option.
spark.read.csv(csv_path, multiLine=True)
# or
spark.read.option('multiLine', True).csv(csv_path)