I'm new to Pyspark and I have a problem to solve.
I have a dataframe with 4 columns, being customers
, person
, is_online_store
and count
:
customer | PersonId | is_online_store | count |
---|---|---|---|
afabd2d2 | 4 | true | 1 |
afabd2d2 | 8 | true | 2 |
afabd2d2 | 3 | true | 1 |
afabd2d2 | 2 | false | 1 |
afabd2d2 | 4 | false | 1 |
I need to create according to the following rules:
- If
PersonId
count(column) has 1 or Trues and None False thenOnline
- If
PersonId
count(column) has 1 or False and None True thenoffline
- If the
PersonId
count(column) has at least 1 False AND 1 True thenHybrid
How it should look:
customer | PersonId | New_Column |
---|---|---|
afabd2d2 | 4 | Hybrid |
afabd2d2 | 8 | Online |
afabd2d2 | 3 | Online |
afabd2d2 | 2 | Offline |
CodePudding user response:
Group by customer
personId
and use when
expression to check if all values in is_online_store
column are true
/ false
or a mix of the 2, using for example bool_and
function:
from pyspark.sql import functions as F
df1 = df.groupBy("customer", "PersonId").agg(
F.when(F.expr("bool_and(is_online_store)"), "Online")
.when(F.expr("bool_and(!is_online_store)"), "Offline")
.otherwise("Hybrid").alias("New_Column")
)
df1.show()
# -------- -------- ----------
#|customer|PersonId|New_Column|
# -------- -------- ----------
#|afabd2d2| 2| Offline|
#|afabd2d2| 8| Online|
#|afabd2d2| 4| Hybrid|
#|afabd2d2| 3| Online|
# -------- -------- ----------