After processing a big data on pyspark, I saved it on csv using the following command:
df.repartition(1).write.option("header", "true").option("delimeter", "\t").csv("csv_data", mode="overwrite")
Now, I want use pd.read_csv()
to load it again.
info = pd.read_csv('part0000.csv', sep='\t', header='infer')
info
is returned as 1 column where the data is separated by comma not '\t'.
col1name,col2name,col3name
val1,val2,val3
I tried to specify the sep=','
but I got an parsing error where some rows have more than 3 cols.
How to fix that without skipping any rows ? Is there anything to do with spark to resolve it such as specify a '|'
as delimiter
CodePudding user response:
The csv format writer method DOESN'T have the delimeter
option, guess what you need is the sep
option.
Please refer to here
CodePudding user response:
As mentioned here, pandas
treats the char "
for queting, and expects "
after every "
which is not always true in my case. To fix this problem, we have to specify quoting=3
to not use quote.
data = pd.read_csv('data.csv', header='infer', sep='\t', engine='python', quoting=3)