For a sample spark df:
sdf = ({'id_col': [25, 13, 15],
'x2': ['b', None, '2'],
'x3': [None, '0', '3'] })
which I would to have:
sdf = ({'id_col': [25, 13, 15],
'x2': [1, 0, 1],
'x3': [0, 1, 1] })
replace all null with 0, and notnull with 1.
Following the syntax from: Filling not null values as 1 in pyspark dataframe. it went trough all the columns of the sdf. For my case, I need to keep the values of id_col as is but replace with 1 and 0 on cols x2 and x3.
I initially tried the following:
I put the x2 and x3 cols in a list:
list = ['x2','x3']
then applied this:
col_selection = [when(col(c).isNull(),0).otherwise(1).alias(c) for c in sdf[[list]].columns]
I get the intended 0s and 1s, but I only get the x2 and x3 columns. When I called the sdf, the changes did not happen. If I have 17 columns to go through in a 25-column sdf, can you guide me on how to apply the for loop on only the 17 columns?
Thanks
CodePudding user response:
If possible to do one by one
from pyspark.sql import functions as F
import pandas as pd
sdf = pd.DataFrame({'id_col': [25, 13, 15],
'x2': ['b', None, '2'],
'x3': [None, '0', '3'] })
sdf=spark.createDataFrame(sdf)
sdf.withColumns({"x2":F.when(F.col("x2").isNull(),0).otherwise(1),"x3":F.when(F.col("x3").isNull(),0).otherwise(1)}).show()
#output
------ --- ---
|id_col| x2| x3|
------ --- ---
| 25| 1| 0|
| 13| 0| 1|
| 15| 1| 1|
------ --- ---
Note-withColumns is only available in spark version >= 3.3.0
To do in loop
lis=['x2','x3']
for a in lis:
sdf=sdf.withColumn(f"{a}",F.when(F.col(f"{a}").isNull(),0).otherwise(1))
sdf.show()
#output
------ --- ---
|id_col| x2| x3|
------ --- ---
| 25| 1| 0|
| 13| 0| 1|
| 15| 1| 1|
------ --- ---