PySpark Count Over Windows Function-CodePudding

I have a function that is driving me crazy and I am supposed to use only PySpark.

The table below is a representation of the data:

There are IDs, Name, Surname and Validity over which I can partition by, but I should lit the value of the percentage of emails that are set correctly by ID.

Like the image below:

How can I solve this problem?

window = Window.partitionBy("ID", "email", "name", "surname", "validity").orderBy(col("ID").desc())

df = df.withColumn("row_num", row_number().over(window))

df_new = df.withColumn("total valid emails per ID", df.select("validity").where(df.validity == "valid" & df.row_num == 1)).count()

CodePudding user response：

Something like:

win = Window.partitionBy("ID", "email", "name", "surname")

df = df.withColumn(
    "pct_valid",
    F.sum(F.when(F.col("validity") == "Valid", 1).otherwise(0)).over(win)
    / F.col("total emails"),
)

CodePudding user response：

This would work:

df.withColumn("ValidAsNumber", F.when(F.col("Validity") == "Valid", 1).otherwise(0))\
  .withColumn("TotalValid", F.sum("ValidAsNumber").over(Window.partitionBy("ID")))\
  .withColumn("PercentValid", F.expr("(TotalValid/TotalEmails)*100")).show()

Input:

Output (I kept the intermediate columns for understanding, you can drop them):