I have an initial dataframe
df1 =
--- --- ---
| A| B| C|
--- --- ---
| 1| 1| 10|
| 1| 2| 11|
| 1| 2| 12|
| 3| 1| 13|
| 2| 1| 14|
| 2| 1| 15|
| 2| 1| 16|
| 4| 1| 17|
| 4| 2| 18|
| 4| 3| 19|
| 4| 4| 19|
| 4| 5| 20|
| 4| 5| 20|
--- --- ---
Using pyspark I coded the dataframe with a window function using a collect_list
function, taking into account the groupping column 'A'
and taking into account the column 'B'
sorted to create a column with cumulative lists
spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',collect_list('C').over(spec))
df1.orderBy('A','B').show()
--- --- --- ------------------------
|A |B |C |D |
--- --- --- ------------------------
|1 |1 |10 |[10] |
|1 |2 |11 |[10, 11, 12] |
|1 |2 |12 |[10, 11, 12] |
|2 |1 |14 |[14, 15, 16] |
|2 |1 |15 |[14, 15, 16] |
|2 |1 |16 |[14, 15, 16] |
|3 |1 |13 |[13] |
|4 |1 |17 |[17] |
|4 |2 |18 |[17, 18] |
|4 |3 |19 |[17, 18, 19] |
|4 |4 |19 |[17, 18, 19, 19] |
|4 |5 |20 |[17, 18, 19, 19, 20, 20]|
|4 |5 |20 |[17, 18, 19, 19, 20, 20]|
--- --- --- ------------------------
Is it possible to do the same calculation using Pandas Dataframe?
I tried using some "normal" python code but probably there is a way to do it more directly.
CodePudding user response:
One way to approach this problem in pandas is to use two groupby's i.e. first group the dataframe on column A
then for each group apply the custom defined function collect_list
which in turns groups input by column B
and cumulatively aggregates the column C
using list
def collect_list(g):
return g.groupby('B')['C'].agg(list).cumsum()
df.sort_values(['A', 'B']).merge(
df.groupby('A').apply(collect_list).reset_index(name='D'))
A B C D
0 1 1 10 [10]
1 1 2 11 [10, 11, 12]
2 1 2 12 [10, 11, 12]
4 2 1 14 [14, 15, 16]
5 2 1 15 [14, 15, 16]
6 2 1 16 [14, 15, 16]
3 3 1 13 [13]
7 4 1 17 [17]
8 4 2 18 [17, 18]
9 4 3 19 [17, 18, 19]
10 4 4 19 [17, 18, 19, 19]
11 4 5 20 [17, 18, 19, 19, 20, 20]
12 4 5 20 [17, 18, 19, 19, 20, 20]