Home > Net >  remove last character from pyspark df columns
remove last character from pyspark df columns

Time:04-06

i'm reading csv file with pyspark like this : df = spark.read.format('csv').options(header=True, encoding='windows-1251',delimiter=';').load('csv_file.csv')

in the result in columns i got string with " ' " single quote character, like this 12435'

there is not a single line in the file with a quote at the end, idk where spark finds it

i need to remove this quote

btw pandas read csv withot quote at the end of every row, but i cant translate pd.DF to spark.DF, i got error cannot merge type DoubleType and StringType

DF has some empty cols

i tried:

from pyspark.sql.functions import *
for i in df.columns:
    df.withColumn(i, expr("substring({name}, 1, length({name}) -1)".format(name=i)))
 
 
for i in df.columns:
    df.withColumn(i, col(i).substr(lit(0), length(col(i)) - 1))

none of this helped me

ty

read df

col1 | col2
12345' abcde'

expected output

col1 | col2
12345  abcde

CodePudding user response:

Use list comprehension

df.select(*[regexp_replace(F.col(c),"'",'').alias(c) for c in df.columns]).show()

 ----- ----- 
| col1| col2|
 ----- ----- 
|12345|abcde|
 ----- ----- 
  • Related