For an example, I have a simple DataFrame like:
index | data1 | replace_me | agg_me | ID |
---|---|---|---|---|
1 | 100 | ( ) | 25 | 1 |
2 | 200 | (-) | 35 | 2 |
3 | 200 | ( ) | 45 | 2 |
4 | 300 | ( ) | 55 | 3 |
5 | 400 | ( ) | 10 | 4 |
6 | 400 | ( ) | 10 | 4 |
7 | 400 | (-) | 10 | 4 |
8 | 400 | (-) | 10 | 4 |
I am trying to aggregate some rows together, whereby there exists a len(groupby of ID) > 1. In the cases where len(groupby ID) > 1, I am looking to:
- Add column "agg_me" together
- Replace (-) and ( ) with (=)
- Enter the (min(agg_me) / sum(agg_me)) into a new column called "Percent".
- Do such that it only "pairs" off rows, ie, it doesnt collapse 4 rows -> 1. So as a result:
index | data1 | replace_me | agg_me | ID | Percent |
---|---|---|---|---|---|
1 | 100 | ( ) | 25 | 1 | 0 |
2 | 200 | (=) | 80 | 2 | 0.4375 |
4 | 300 | ( ) | 55 | 3 | 0 |
5 | 400 | (=) | 20 | 4 | 0.5 |
6 | 400 | (=) | 20 | 4 | 0.5 |
Any help is appreciated!
CodePudding user response:
Try this:
vc = df['ID'].map(df['ID'].value_counts()).gt(1)
pd.concat([df.loc[~vc],
df.loc[vc]
.groupby(['ID',df.groupby('ID').cumcount().floordiv(2)]).agg(
index = ('index','first'),
data1 = ('data1','first'),
replace_me = ('replace_me',lambda x: '(=)'),
agg_me = ('agg_me','sum'),
Percent = ('agg_me',lambda x: x.min()/x.sum()))
.reset_index(level=0)]).fillna(0).sort_values('ID').reset_index(drop=True)
Output:
index data1 replace_me agg_me ID Percent
0 1 100 ( ) 25 1 0.0000
1 2 200 (=) 80 2 0.4375
2 4 300 ( ) 55 3 0.0000
3 5 400 (=) 20 4 0.5000
4 7 400 (=) 20 4 0.5000