Home > Enterprise >  Logical with count in Pyspark
Logical with count in Pyspark

Time:02-11

I'm new to Pyspark and I have a problem to solve.

I have a dataframe with 4 columns, being customers, person, is_online_store and count:

customer PersonId is_online_store count
afabd2d2 4 true 1
afabd2d2 8 true 2
afabd2d2 3 true 1
afabd2d2 2 false 1
afabd2d2 4 false 1

I need to create according to the following rules:

  • If PersonId count(column) has 1 or Trues and None False then Online
  • If PersonId count(column) has 1 or False and None True then offline
  • If the PersonId count(column) has at least 1 False AND 1 True then Hybrid

How it should look:

customer PersonId New_Column
afabd2d2 4 Hybrid
afabd2d2 8 Online
afabd2d2 3 Online
afabd2d2 2 Offline

CodePudding user response:

Group by customer personId and use when expression to check if all values in is_online_store column are true / false or a mix of the 2, using for example bool_and function:

from pyspark.sql import functions as F

df1 = df.groupBy("customer", "PersonId").agg(
    F.when(F.expr("bool_and(is_online_store)"), "Online")
     .when(F.expr("bool_and(!is_online_store)"), "Offline")
     .otherwise("Hybrid").alias("New_Column")
)

df1.show()
# -------- -------- ---------- 
#|customer|PersonId|New_Column|
# -------- -------- ---------- 
#|afabd2d2|       2|   Offline|
#|afabd2d2|       8|    Online|
#|afabd2d2|       4|    Hybrid|
#|afabd2d2|       3|    Online|
# -------- -------- ---------- 
  • Related