Spark dataframe from CSV file with separator surrounded with quotes-CodePudding

I have a CSV (with ; as a separator) with several columns made like this:

c1;c2;c3
1;foo";"bar;10
2;foo;20
3;bar;30
...

The ";" is used by the person who made the file as a way to tell that ; is not a separator.

I need to read the file with spark.read.options(delimiter=';',header=True,inferSchema=True).csv('path\to\file) but I obtain:

c1 | c2  | c3
1  | foo"| bar;10
2  | foo | 20
3  | bar | 30
...

The one I need:

c1 c2     c3
1 foo;bar 10
2 foo     20
3 bar     30
...

Is there a way to change delimiter=';' to obtain the dataset I need? I have no way to change neither the delimiter or the ";".

CodePudding user response：

You could:

read your CSV as a text file
replace all ; with |
replace all "|" with ;
split on |

rdd = spark.sparkContext.textFile(r'your\path\test.csv')
rdd = rdd.map(lambda line: line.replace(';', '|').replace('"|"', ';').split('|'))
header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)
df.show()
#  --- ------- --- 
# | c1|     c2| c3|
#  --- ------- --- 
# |  1|foo;bar| 10|
# |  2|    foo| 20|
# |  3|    bar| 30|
#  --- ------- ---