Home > Enterprise >  How to count specific rows?
How to count specific rows?

Time:12-27

I have an example of pyspark dataframe:

X Y Z DATE
23 41 63 2016-01-01
23 41 5 2016-01-01
23 41 75 2016-01-01
23 41 46 2016-12-01
23 41 23 2016-12-01
27 41 5 2016-01-01
27 41 75 2016-01-01
27 41 85 2016-01-01
27 41 71 2016-01-01

What I want is to count rows with the same X, Y and DATE columns and store the value in a new column.

Final dataframe should looks like this:

X Y Z DATE SUM
23 41 63 2016-01-01 3
23 41 5 2016-01-01 3
23 41 75 2016-01-01 3
23 41 46 2016-12-01 2
23 41 23 2016-12-01 2
27 41 5 2016-01-01 4
27 41 75 2016-01-01 4
27 41 85 2016-01-01 4
27 41 71 2016-01-01 4

CodePudding user response:

This might help (assuming you mistyped and wanted Count instead of sum):

from pyspark.sql.functions import count

df = df.withColumn("Count", count("*").over(Window.partitionBy("X", "Y", "Date")))
  • Related