Suppose following is my dataframe
df
userId | deviceID | Clean_date |
---|---|---|
ABC123 | 202030 | 28-Jul-22 |
XYZ123 | 304050 | 27-Jul-22 |
ABC123 | 405032 | 28-Jul-22 |
PQR123 | 385625 | 22-Jun-22 |
WER123 | 465728 | 2-May-22 |
XYZ123 | 935452 | 22-Mar-22 |
I want to have output like which user_id
with multiple devices on same day as 'P1'
, user_id
with multiple devices on different days as 'P2'
and else 'NA'
Following as sample output
df_output
userId | deviceID | Clean_date | Priority |
---|---|---|---|
ABC123 | 202030 | 28-Jul-22 | P1 |
XYZ123 | 304050 | 27-Jul-22 | P2 |
ABC123 | 405032 | 28-Jul-22 | P1 |
PQR123 | 385625 | 22-Jun-22 | NA |
WER123 | 465728 | 2-May-22 | NA |
XYZ123 | 935452 | 22-Mar-22 | P2 |
Suggest the solution in pyspark
CodePudding user response:
You can count distinct deviceID
per userId
and per userId
Clean_date
using Window then using when
expression calculate Priority
based on counts like this:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([
("ABC123", 202030, "28-Jul-22"),("XYZ123", 304050, "27-Jul-22"),
("ABC123", 405032, "28-Jul-22"),("PQR123", 385625, "22-Jun-22"),
("WER123", 465728, "02-May-22"),("XYZ123", 935452, "22-Mar-22")
], ["userId", "deviceID", "Clean_date"])
w = Window.partitionBy("userId")
w2 = Window.partitionBy("userId", "Clean_date")
df = df.withColumn(
"Priority",
F.when(F.size(F.collect_set("deviceID").over(w2)) > 1, "P1")
.when(F.size(F.collect_set("deviceID").over(w)) > 1, "P2")
.otherwise("NA")
)
df.show()
# ------ -------- ---------- --------
# |userId|deviceID|Clean_date|Priority|
# ------ -------- ---------- --------
# |ABC123| 202030| 28-Jul-22| P1|
# |ABC123| 405032| 28-Jul-22| P1|
# |PQR123| 385625| 22-Jun-22| NA|
# |WER123| 465728| 02-May-22| NA|
# |XYZ123| 935452| 22-Mar-22| P2|
# |XYZ123| 304050| 27-Jul-22| P2|
# ------ -------- ---------- --------