I have the following dataframe (I didnt include an index here, but obvisouly there is also an index)
ID_1 | ID_2 | Count |
---|---|---|
55 | 62 | 1000 |
62 | 55 | 1200 |
... | ... | ... |
Now I would like to aggregate those two columns, since I do not care if the ID is in the column ID_1 or in ID_2.
I would like to get the following result:
ID_1 | ID_2 | Count |
---|---|---|
55 | 62 | 2200 |
62 | 55 | 2200 |
... | ... | ... |
That means that I want to sum the Count column over all the rows in my dataframe where two IDs are the same (doesnt care if they are in ID_1 column or ID_2 column).
I thought about grouping the dataframe, but that did not work properly.
I am happy for any help!
CodePudding user response:
Create virtual groups:
make_group = lambda x: tuple(sorted(x))
df['Count'] = df.groupby(df[['ID_1', 'ID_2']].apply(make_group, axis=1))['Count'] \
.transform('sum')
Output:
>>> df
ID_1 ID_2 Count
0 55 62 2200
1 62 55 2200
# virtual groups
>>> df[['ID_1', 'ID_2']].apply(make_group, axis=1)
0 (55, 62)
1 (55, 62)
dtype: object
CodePudding user response:
Sort the ID columns row wise
df[['ID_1', 'ID_2']] = np.sort(df[['ID_1', 'ID_2']], axis=1)
Groupby the ID columns now
df.groupby(['ID_1', 'ID_2']).transform(sum)
CodePudding user response:
sort row values using np.sort , groupby and aggregate. Code below
df=df.assign(Count=pd.DataFrame(np.sort(df.values), columns=df.columns).groupby(['ID_1','ID_2']).transform('sum'))
Alternatively use agg('sort') to sort and then groupby
df[df.filter(regex='^ID').columns] =df.filter(regex='^ID').agg('sort')
df['Count']=df.groupby(['ID_1','ID_2']).transform('sum')
ID_1 ID_2 Count
0 55 62 2200
1 62 55 2200