I am trying to remove "" and ; from my CSV file in PySpark. The data in CSV looks like below:
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
Code I am using is:
df = spark.read.options(delimiter=';').csv("C:/Project_bankdata.csv", header=True)
df1 = df.select([F.regexp_replace(c, '"', '').alias(c) for c in df.columns])
df1.show(10,truncate=0)
Output:
|"age;""job"" |""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""|""previous""|""poutcome""|""y"""|
--------------- ----------- ------------- ----------- ----------- ----------- -------- ----------- ------- --------- ------------ ------------ --------- ------------ ------------ ------
|58;management |married |tertiary |no |2143 |yes |no |unknown |5 |may |261 |1 |-1 |0 |unknown |no |
I am able to get rid of quotes from data, but not from the header. How can I remove double quotes from header as well?
CodePudding user response:
this is dataframe output that I am getting when running your code and using the csv data you posted. I am using spark version 2.3.2
--- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
|age|job |marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|y |
--- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
|58 |management|married|tertiary |no |2143 |yes |no |unknown|5 |may |261 |1 |-1 |0 |unknown |no |
--- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
CodePudding user response:
I was only able to reproduce your output if I used this input CSV:
"age;""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""
"58;"management"";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
You can read the CSV as text file, remove all the double quotes "
from every line and then make a dataframe.
rdd = spark.sparkContext.textFile(r"C:\temp\temp.csv")
rdd = rdd.map(lambda line: line.replace('"', '').split(';'))
header = rdd.first()
df = rdd.filter(lambda line: line != header).toDF(header)
df.show()
# --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
# |age| job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome| y|
# --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
# | 58|management|married| tertiary| no| 2143| yes| no|unknown| 5| may| 261| 1| -1| 0| unknown| no|
# --- ---------- ------- --------- ------- ------- ------- ---- ------- --- ----- -------- -------- ----- -------- -------- ---
Note. This effectively removes string notation from the CSV file. So, this will only work well, if you don't have such values which contain ;
inside them.