How to add the new columns based on counts identified by conditions applied to multiple columns in P-CodePudding

Suppose following is my dataframe

userId	deviceID	Clean_date
ABC123	202030	28-Jul-22
XYZ123	304050	27-Jul-22
ABC123	405032	28-Jul-22
PQR123	385625	22-Jun-22
WER123	465728	2-May-22
XYZ123	935452	22-Mar-22

I want to have output like which user_id with multiple devices on same day as 'P1', user_id with multiple devices on different days as 'P2' and else 'NA'

Following as sample output

df_output

userId	deviceID	Clean_date	Priority
ABC123	202030	28-Jul-22	P1
XYZ123	304050	27-Jul-22	P2
ABC123	405032	28-Jul-22	P1
PQR123	385625	22-Jun-22	NA
WER123	465728	2-May-22	NA
XYZ123	935452	22-Mar-22	P2

Suggest the solution in pyspark

CodePudding user response：

You can count distinct deviceID per userId and per userId Clean_date using Window then using when expression calculate Priority based on counts like this:

from pyspark.sql import functions as F, Window

df = spark.createDataFrame([
    ("ABC123", 202030, "28-Jul-22"),("XYZ123", 304050, "27-Jul-22"),
    ("ABC123", 405032, "28-Jul-22"),("PQR123", 385625, "22-Jun-22"),
    ("WER123", 465728, "02-May-22"),("XYZ123", 935452, "22-Mar-22")
], ["userId", "deviceID", "Clean_date"])

w = Window.partitionBy("userId")
w2 = Window.partitionBy("userId", "Clean_date")

df = df.withColumn(
    "Priority",
    F.when(F.size(F.collect_set("deviceID").over(w2)) > 1, "P1")
    .when(F.size(F.collect_set("deviceID").over(w)) > 1, "P2")
    .otherwise("NA")
)
df.show()

#  ------ -------- ---------- -------- 
# |userId|deviceID|Clean_date|Priority|
#  ------ -------- ---------- -------- 
# |ABC123|  202030| 28-Jul-22|      P1|
# |ABC123|  405032| 28-Jul-22|      P1|
# |PQR123|  385625| 22-Jun-22|      NA|
# |WER123|  465728| 02-May-22|      NA|
# |XYZ123|  935452| 22-Mar-22|      P2|
# |XYZ123|  304050| 27-Jul-22|      P2|
#  ------ -------- ---------- --------