Spark csv file size 2x bigger than with pandas-CodePudding

As I save data to a single csv file with pyspark I get a 2x bigger file size than if it was after transforming it with .toPandas() and then saving it using to_csv().

Any thoughts on what can cause that big discrepancy?

CodePudding user response：

There are a couple of things that can take an effect on size difference.

emptyValue: By default, Spark's df.write.csv will save the null value with wrapping double quote. This increases 2 chars per a single null value for Spark's write. To disable the double quote wrapping, use .csv(path, emptyValue='')

# Spark write.csv

some value,"",""

# Pandas.to_csv

some value,,

Panda's datatype implicit casting.: When you have nullable integer value in Spark, Pandas will cast integer to float as in Pandas integer is not nullable. This will decrease some characters for Spark's write.

# Spark write.csv

some value,1000,
some value,,

# Pandas .to_csv

some value,1000.0,
some value,,

Other minor things can be Spark won't write header by default while Pandas does. Or Pandas will write index value by default whereas Spark doesn't have index.

I think there are more things that could affect on how both platform stores data. To observe the difference, I would save the fraction of the data and try look at the file in plain text to see the difference.