I have a sample dataframe:
df = spark.createDataFrame([('name1','id1',1,None,3),('name2','id2',None,2,5)],['NAME','personID','col1','col2',col3'])
My use case has 15 columns
What I would like to do is using case when and loop, to add new columns that correspond to each column from the original except the first two columns. Within those new columns, it will give a value of 1 if notNull, otherwise 0.
I am aiming to get something like below:
-------- -------- -------- ------- ------- ------- ------ ------
|Name | ID | col1 | col2 | col3 | col1_N|col2_N|col3_N|
-------- -------- -------- ------- ------- ------- ------ ------
|name1 | id1 | 1 | Null | 3 | 1 | 0 | 1 |
|name2 | id2 | Null | 2 | 5 | 0 | 1 | 1 |
-------- -------- -------- ------- ------- ------- ------ ------
the first five columns are the original columns, the last three columns will be added with corresponding 1 or 0 from 'col1', 'col2', and 'col3' values.
The last code/s I am working on creates a new one but does not keep the original dataframe values.
df.select([when(col(c).isNotNull(), 1).otherwise(0).alias(c '_N') for c in df.columns])
for which I get:
------- ------- ------- ------ ------
| Name_N| ID_N | col1_N|col2_N|col3_N|
------- ------- ------- ------ ------
| 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 |
------- ------- ------- ------ ------
The above could have been acceptable but I need to keep the original values of Name and ID columns.
I got an INvalidArgument with this df.select(['*'],[when(col(c).isNotNull(), 1).otherwise(0).alias(c '_N') for c in df.columns])
TypeError: Invalid argument, not a string or column: ['*'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I thought selecting all first will give me all the columns of the original
UPDATE: somehow this worked, but I only get the last column:
for c in df.columns[2:]:
sdf = df.withColumn(c '_N', when(col(c).isNotNull(),1).otherwise(0))
but this is what I get:
-------- -------- -------- ------- ------- ------
|Name | ID | col1 | col2 | col3 |col3_N|
-------- -------- -------- ------- ------- ------
|name1 | id1 | 1 | Null | 3 | 1 |
|name2 | id2 | Null | 2 | 5 | 1 |
-------- -------- -------- ------- ------- ------
``
I only got the last original column
CodePudding user response:
Using list comprehension as show below will give expected result.
df.select([col(c) if c in ['NAME', 'personID'] else when(col(c).isNotNull(), 1).otherwise(0).alias(f"{c}_N") for c in df.columns]).show()
----- -------- ------ ------ ------
| NAME|personID|col1_N|col2_N|col3_N|
----- -------- ------ ------ ------
|name1| id1| 1| 0| 1|
|name2| id2| 0| 1| 1|
----- -------- ------ ------ ------
CodePudding user response:
Just fix your 1st approach with specifying a slice of columns and simplify boolean condition:
df.select([col(c).isNotNull().cast("integer").alias(c '_N') for c in df.columns[2:]])