Home > Mobile >  How to create a new column in pyspark
How to create a new column in pyspark

Time:09-15

I have a spark dataframe as below

a1 a2 a3 a4
A 12 9 1
B 14 13 1
C 7 3 0

I want to create a new column A5 such with conditioning such as

       if a1 = A then a5 = Car
       if a2>0   then a5 = Bus
       if a3>0 and a4 =1 then a5 = Bike

The desired output should as below

a1 a2 a3 a4 a5
A 12 9 1 Car
A 12 9 1 Bus
A 12 9 1 Bike
B 14 13 1 Bus
B 14 13 1 Bike
C 7 3 0 Bus

Please help on how to add this new column. Thank you in advance

CodePudding user response:

You can define your own function

from pyspark.sql import functions as f
from pyspark.sql.types import StringType


def myfunction(a1, a2, a3):
   if a1 == "A":
       return "Car"
   elif a2 > 0:
       return "Bus"
   elif a3 > 0 and a4 == 1:
       return "Bike"

df = df.withColumn("a5", f.udf(myfunction(df.a1, df.a2, df.a3), StringType()))

CodePudding user response:

df = spark.createDataFrame(
    [
    ('A','12','9','1'),
    ('B','14','13','1'),
    ('C','7','3','0')
    ],
    ['a1','a2','a3','a4']
)

from pyspark.sql.functions import when, lit, col

res = df.withColumn("a5",when(df.a1 == 'A', lit('Car')))\
        .unionByName(df.withColumn("a5",when(df.a2 > 0, lit('Bus'))))\
        .unionByName(df.withColumn("a5",when((df.a3 > 0)&(df.a4 == 1), lit('Bike'))))\
        .filter(col('a5').isNotNull())\
        .orderBy(col('a1').asc())


res.show()

#  --- --- --- --- ---- 
# | a1| a2| a3| a4|  a5|
#  --- --- --- --- ---- 
# |  A| 12|  9|  1| Bus|
# |  A| 12|  9|  1| Car|
# |  A| 12|  9|  1|Bike|
# |  B| 14| 13|  1| Bus|
# |  B| 14| 13|  1|Bike|
# |  C|  7|  3|  0| Bus|
#  --- --- --- --- ---- 
  • Related