I have a CSV (with ;
as a separator) with several columns made like this:
c1;c2;c3
1;foo";"bar;10
2;foo;20
3;bar;30
...
The ";"
is used by the person who made the file as a way to tell that ;
is not a separator.
I need to read the file with spark.read.options(delimiter=';',header=True,inferSchema=True).csv('path\to\file)
but I obtain:
c1 | c2 | c3
1 | foo"| bar;10
2 | foo | 20
3 | bar | 30
...
The one I need:
c1 c2 c3
1 foo;bar 10
2 foo 20
3 bar 30
...
Is there a way to change delimiter=';'
to obtain the dataset I need? I have no way to change neither the delimiter or the ";".
CodePudding user response:
You could:
- read your CSV as a text file
- replace all
;
with|
- replace all
"|"
with;
- split on
|
rdd = spark.sparkContext.textFile(r'your\path\test.csv')
rdd = rdd.map(lambda line: line.replace(';', '|').replace('"|"', ';').split('|'))
header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)
df.show()
# --- ------- ---
# | c1| c2| c3|
# --- ------- ---
# | 1|foo;bar| 10|
# | 2| foo| 20|
# | 3| bar| 30|
# --- ------- ---