Read spark csv dataframe as pandas-CodePudding

After processing a big data on pyspark, I saved it on csv using the following command:

df.repartition(1).write.option("header", "true").option("delimeter", "\t").csv("csv_data", mode="overwrite")

Now, I want use pd.read_csv() to load it again.

info = pd.read_csv('part0000.csv', sep='\t', header='infer')

info is returned as 1 column where the data is separated by comma not '\t'.

col1name,col2name,col3name
val1,val2,val3

I tried to specify the sep=',' but I got an parsing error where some rows have more than 3 cols.

How to fix that without skipping any rows ? Is there anything to do with spark to resolve it such as specify a '|' as delimiter

CodePudding user response：

The csv format writer method DOESN'T have the delimeter option, guess what you need is the sep option.

Please refer to here

CodePudding user response：

As mentioned here, pandas treats the char " for queting, and expects " after every " which is not always true in my case. To fix this problem, we have to specify quoting=3 to not use quote.

data = pd.read_csv('data.csv', header='infer', sep='\t', engine='python', quoting=3)