I am trying to clean my dataset from missing values. In the rows there are values like
ID | A | B |
---|---|---|
1 | 324 | |
2 | Breda | |
3 | null | 34556 |
I would like to see in A1 and B2 null and so on without doing the cleaning column by column. I would like to loop over each columns without specifying the column names
I have found this code but the last raw returns an error : My table name is custom
def replaceEmptyCols(columns:Array[String]):Array[Column]={
columns.map(c>={
when(col(c)=="" ,null).otherwise(col(c)).alias(c)
})
}
custom.select(replaceEmptyCols(custom.columns):_*).show()
The error is :
SyntaxError: invalid syntax (, line 6)
File "<command-447346330485202>", line 6
custom.select(replaceEmptyCols(custom.columns):_*).show()
^
SyntaxError: invalid syntax
CodePudding user response:
Maybe you are looking for something like this?
custom = spark.createDataFrame(
[('1','','324')
,('2','Breda','')
,('3',None,'34556')
],
['ID','A','B']
)
custom.show()
# --- ----- -----
# | ID| A| B|
# --- ----- -----
# | 1| | 324|
# | 2|Breda| |
# | 3| null|34556|
# --- ----- -----
import pyspark.sql.functions as F
from pyspark.sql.types import *
def replaceEmptyCols(df, columns:[]):
for c in columns:
df = df.withColumn(c, F.when((F.col(c) == '') | (F.col(c) == None), F.lit('null')
).otherwise(F.col(c)))
return df
replaceEmptyCols(custom, [c for c in custom.columns if c not in ['ID']]).show()
# --- ----- -----
# | ID| A| B|
# --- ----- -----
# | 1| null| 324|
# | 2|Breda| null|
# | 3| null|34556|
# --- ----- -----