Home > Software design >  PySpark Dataframe: Column based on existence and Value of another column
PySpark Dataframe: Column based on existence and Value of another column

Time:08-05

What I am trying to do is set a Value "EXIST" based on a .isnotnull in potentially nonexisting column.

What I mean is: I have a dataframe A like

A     B     C
-------------------
1     a     "Test"  
2     b     null
3     c     "Test2"

Where C isnt necessarily defined. I want to define another Dataframe B B:

D     E       F
---------------
1     a      'J'
2     b      'N'
3     c      'J'

Where is the Column B.F is either 'N' everywhere in case that A.C is not defined, or 'N' if A.Cs value is null and 'J' if the value is not null.

How would you proceed at this point?

I thought of using when statement

DF.withColumn('F'. when(A.C.isNotNull(), 'J').otherwise('N')) but how would you check for the existence of the Column in the same statement?

CodePudding user response:

First you check if the column exists. If not, you create it.

from pyspark.sql import functions as F

if "c" not in df.columns:
    df = df.withColumn("c", F.lit(None))

then you create the column F :

df.withColumn('F'. F.when(F.col("C").isNotNull(), 'J').otherwise('N'))

CodePudding user response:

You can check the column's present using 'c' in data_sdf.columns. Here's an example using it.

Let's say the input dataframe has 3 columns - ['a', 'b', 'c']

data_sdf. \
    withColumn('d', 
               func.when(func.col('c').isNull() if 'c' in data_sdf.columns else func.lit(True), func.lit('N')).
               otherwise(func.lit('J'))
               ). \
    show()

#  --- --- ---- --- 
# |  a|  b|   c|  d|
#  --- --- ---- --- 
# |  1|  2|   3|  J|
# |  1|  2|null|  N|
#  --- --- ---- --- 

Now, let's say there are only 2 columns - ['a', 'b']

#  --- --- --- 
# |  a|  b|  d|
#  --- --- --- 
# |  1|  2|  N|
# |  1|  2|  N|
#  --- --- --- 
  • Related