I have an example of pyspark dataframe:
X | Y | Z | DATE |
---|---|---|---|
23 | 41 | 63 | 2016-01-01 |
23 | 41 | 5 | 2016-01-01 |
23 | 41 | 75 | 2016-01-01 |
23 | 41 | 46 | 2016-12-01 |
23 | 41 | 23 | 2016-12-01 |
27 | 41 | 5 | 2016-01-01 |
27 | 41 | 75 | 2016-01-01 |
27 | 41 | 85 | 2016-01-01 |
27 | 41 | 71 | 2016-01-01 |
What I want is to count rows with the same X
, Y
and DATE
columns and store the value in a new column.
Final dataframe should looks like this:
X | Y | Z | DATE | SUM |
---|---|---|---|---|
23 | 41 | 63 | 2016-01-01 | 3 |
23 | 41 | 5 | 2016-01-01 | 3 |
23 | 41 | 75 | 2016-01-01 | 3 |
23 | 41 | 46 | 2016-12-01 | 2 |
23 | 41 | 23 | 2016-12-01 | 2 |
27 | 41 | 5 | 2016-01-01 | 4 |
27 | 41 | 75 | 2016-01-01 | 4 |
27 | 41 | 85 | 2016-01-01 | 4 |
27 | 41 | 71 | 2016-01-01 | 4 |
CodePudding user response:
This might help (assuming you mistyped and wanted Count instead of sum):
from pyspark.sql.functions import count
df = df.withColumn("Count", count("*").over(Window.partitionBy("X", "Y", "Date")))