Home > Mobile >  How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?
How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

Time:10-28

I have an initial dataframe

df1 = 
 --- --- --- 
|  A|  B|  C|
 --- --- --- 
|  1|  1| 10|
|  1|  2| 11|
|  1|  2| 12|
|  3|  1| 13|
|  2|  1| 14|
|  2|  1| 15|
|  2|  1| 16|
|  4|  1| 17|
|  4|  2| 18|
|  4|  3| 19|
|  4|  4| 19|
|  4|  5| 20|
|  4|  5| 20|
 --- --- --- 

Using pyspark I coded the dataframe with a window function using a sum taking into account the column 'A' and taking into account the column 'B' sorted.

spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',sum('C').over(spec))

df1.show()

 --- --- --- ----- 
|  A|  B|  C|    D|
 --- --- --- ----- 
|  1|  1| 10| 10.0|
|  1|  2| 11| 33.0|
|  1|  2| 12| 33.0|
|  2|  1| 14| 45.0|
|  2|  1| 15| 45.0|
|  2|  1| 16| 45.0|
|  3|  1| 13| 13.0|
|  4|  1| 17| 17.0|
|  4|  2| 18| 35.0|
|  4|  3| 19| 54.0|
|  4|  4| 19| 73.0|
|  4|  5| 20|113.0|
|  4|  5| 20|113.0|
 --- --- --- ----- 

Is it possible to do the same calculation using Pandas Dataframe?

I tried using

df['D'] = df.sort_values(['A','B']).groupby(['A', 'B'])['C'].transform('cumsum')

but it is not the same result

Thanks

CodePudding user response:

In pandas we can groupby sum on A and B. Then groupby cumsum just over A. To add results back to the DataFrame rename to the new column name then join the results back to the DataFrame on the initial group keys ['A', 'B']:

df1 = df1.join(
    df1.groupby(by=['A', 'B'])['C'].sum()
        .groupby(level='A').cumsum()
        .rename('D'),
    on=['A', 'B']
)

df1:

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113

Setup:

import pandas as pd

df1 = pd.DataFrame({
    'A': [1, 1, 1, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4],
    'B': [1, 2, 2, 1, 1, 1, 1, 1, 2, 3, 4, 5, 5],
    'C': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 20, 20]
})

CodePudding user response:

You can sort, then cumsum within 'A', then groupby max within ['A', 'B'] groups using transform so you can assign the result back.

df1['D'] = (df1.sort_values(['A', 'B'])
               .groupby('A')['C'].cumsum()
               .groupby([df1['A'], df1['B']]).transform('max'))

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113
  • Related