I have the next DataFrame:
| id | target | session | smth |
| ---- | -------- | -------- | ---------------- |
| 1 | 0 | 1 | np.array(1,2,3) |
| 1 | 1 | 1 | np.array(5,7,1) |
| 1 | 0 | 1 | np.array(2,3,4) |
| 1 | 1 | 1 | np.array(3,4,5) |
| 1 | 0 | 2 | np.array(2,2,8) |
| 1 | 0 | 2 | np.array(4,2,0) |
| 1 | 0 | 2 | np.array(0,0,0) |
| 1 | 0 | 2 | np.array(1,3,3) |
| 1 | 1 | 3 | np.array(1,4,4) |
| 1 | 1 | 3 | np.array(1,5,5) |
| 1 | 0 | 3 | np.array(1,6,6) |
| 1 | 0 | 3 | np.array(1,7,7) |
| 1 | 0 | 3 | np.array(1,8,3) |
| 2 | 1 | 1 | np.array(1,9,3) |
I need to aggregate all np.arrays in the column "smth", by groups if there are at least two "1" and two "0" values in column target for an each user by their sessions. For example, for this DataFrame we will get:
UPD: I need to save target, otherwise it's impossible to restore them.
For id "1":
[[0, np.array(1,2,3)], [1, np.array(5,7,1)], [0, np.array(2,3,4)], [1, np.array(3,4,5)]]
[[1, np.array(1,4,4)] , [1, np.array(1,5,5)], [0 ,np.array(1,6,6)], [0, np.array(1,7,7)], [0, np.array(1,8,3)]]
That is an extremely hard aggregation
Tried many group by methods, but this is really hard
CodePudding user response:
One option using groupby.apply
:
(df.groupby(['id', 'session'])
.apply(lambda d: list(zip(d['target'], d['smth']))
if d['target'].value_counts().reindex([0, 1]).ge(2).all()
else None
)
.dropna()
)
Output:
id session
1 1 [(0, [1, 2, 3]), (1, [5, 7, 1]), (0, [2, 3, 4]...
3 [(1, [1, 4, 4]), (1, [1, 5, 5]), (0, [1, 6, 6]...
dtype: object