I have a spark dataframe as below
a1 | a2 | a3 | a4 |
---|---|---|---|
A | 12 | 9 | 1 |
B | 14 | 13 | 1 |
C | 7 | 3 | 0 |
I want to create a new column A5 such with conditioning such as
if a1 = A then a5 = Car
if a2>0 then a5 = Bus
if a3>0 and a4 =1 then a5 = Bike
The desired output should as below
a1 | a2 | a3 | a4 | a5 |
---|---|---|---|---|
A | 12 | 9 | 1 | Car |
A | 12 | 9 | 1 | Bus |
A | 12 | 9 | 1 | Bike |
B | 14 | 13 | 1 | Bus |
B | 14 | 13 | 1 | Bike |
C | 7 | 3 | 0 | Bus |
Please help on how to add this new column. Thank you in advance
CodePudding user response:
You can define your own function
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
def myfunction(a1, a2, a3):
if a1 == "A":
return "Car"
elif a2 > 0:
return "Bus"
elif a3 > 0 and a4 == 1:
return "Bike"
df = df.withColumn("a5", f.udf(myfunction(df.a1, df.a2, df.a3), StringType()))
CodePudding user response:
df = spark.createDataFrame(
[
('A','12','9','1'),
('B','14','13','1'),
('C','7','3','0')
],
['a1','a2','a3','a4']
)
from pyspark.sql.functions import when, lit, col
res = df.withColumn("a5",when(df.a1 == 'A', lit('Car')))\
.unionByName(df.withColumn("a5",when(df.a2 > 0, lit('Bus'))))\
.unionByName(df.withColumn("a5",when((df.a3 > 0)&(df.a4 == 1), lit('Bike'))))\
.filter(col('a5').isNotNull())\
.orderBy(col('a1').asc())
res.show()
# --- --- --- --- ----
# | a1| a2| a3| a4| a5|
# --- --- --- --- ----
# | A| 12| 9| 1| Bus|
# | A| 12| 9| 1| Car|
# | A| 12| 9| 1|Bike|
# | B| 14| 13| 1| Bus|
# | B| 14| 13| 1|Bike|
# | C| 7| 3| 0| Bus|
# --- --- --- --- ----