Home > Blockchain >  How to replicate a Spark window function with collect_list in Pandas dataframes?
How to replicate a Spark window function with collect_list in Pandas dataframes?

Time:11-02

I have an initial dataframe

df1 = 
 --- --- --- 
|  A|  B|  C|
 --- --- --- 
|  1|  1| 10|
|  1|  2| 11|
|  1|  2| 12|
|  3|  1| 13|
|  2|  1| 14|
|  2|  1| 15|
|  2|  1| 16|
|  4|  1| 17|
|  4|  2| 18|
|  4|  3| 19|
|  4|  4| 19|
|  4|  5| 20|
|  4|  5| 20|
 --- --- --- 

Using pyspark I coded the dataframe with a window function using a collect_list function, taking into account the groupping column 'A' and taking into account the column 'B' sorted to create a column with cumulative lists

spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',collect_list('C').over(spec))

df1.orderBy('A','B').show()

 --- --- --- ------------------------ 
|A  |B  |C  |D                       |
 --- --- --- ------------------------ 
|1  |1  |10 |[10]                    |
|1  |2  |11 |[10, 11, 12]            |
|1  |2  |12 |[10, 11, 12]            |
|2  |1  |14 |[14, 15, 16]            |
|2  |1  |15 |[14, 15, 16]            |
|2  |1  |16 |[14, 15, 16]            |
|3  |1  |13 |[13]                    |
|4  |1  |17 |[17]                    |
|4  |2  |18 |[17, 18]                |
|4  |3  |19 |[17, 18, 19]            |
|4  |4  |19 |[17, 18, 19, 19]        |
|4  |5  |20 |[17, 18, 19, 19, 20, 20]|
|4  |5  |20 |[17, 18, 19, 19, 20, 20]|
 --- --- --- ------------------------ 


Is it possible to do the same calculation using Pandas Dataframe?

I tried using some "normal" python code but probably there is a way to do it more directly.

CodePudding user response:

One way to approach this problem in pandas is to use two groupby's i.e. first group the dataframe on column A then for each group apply the custom defined function collect_list which in turns groups input by column B and cumulatively aggregates the column C using list

def collect_list(g):
    return g.groupby('B')['C'].agg(list).cumsum()

df.sort_values(['A', 'B']).merge(
    df.groupby('A').apply(collect_list).reset_index(name='D'))

    A  B   C                         D
0   1  1  10                      [10]
1   1  2  11              [10, 11, 12]
2   1  2  12              [10, 11, 12]
4   2  1  14              [14, 15, 16]
5   2  1  15              [14, 15, 16]
6   2  1  16              [14, 15, 16]
3   3  1  13                      [13]
7   4  1  17                      [17]
8   4  2  18                  [17, 18]
9   4  3  19              [17, 18, 19]
10  4  4  19          [17, 18, 19, 19]
11  4  5  20  [17, 18, 19, 19, 20, 20]
12  4  5  20  [17, 18, 19, 19, 20, 20]
  • Related