Pandas Dataframe Aggregation-CodePudding

I have the following dataframe (I didnt include an index here, but obvisouly there is also an index)

ID_1	ID_2	Count
55	62	1000
62	55	1200
...	...	...

Now I would like to aggregate those two columns, since I do not care if the ID is in the column ID_1 or in ID_2.

I would like to get the following result:

ID_1	ID_2	Count
55	62	2200
62	55	2200
...	...	...

That means that I want to sum the Count column over all the rows in my dataframe where two IDs are the same (doesnt care if they are in ID_1 column or ID_2 column).

I thought about grouping the dataframe, but that did not work properly.

I am happy for any help!

CodePudding user response：

Create virtual groups:

make_group = lambda x: tuple(sorted(x))

df['Count'] = df.groupby(df[['ID_1', 'ID_2']].apply(make_group, axis=1))['Count'] \
                .transform('sum')

Output:

>>> df
   ID_1  ID_2  Count
0    55    62   2200
1    62    55   2200

# virtual groups
>>> df[['ID_1', 'ID_2']].apply(make_group, axis=1)
0    (55, 62)
1    (55, 62)
dtype: object

CodePudding user response：

Sort the ID columns row wise

df[['ID_1', 'ID_2']] = np.sort(df[['ID_1', 'ID_2']], axis=1)

Groupby the ID columns now

df.groupby(['ID_1', 'ID_2']).transform(sum)

CodePudding user response：

sort row values using np.sort , groupby and aggregate. Code below

df=df.assign(Count=pd.DataFrame(np.sort(df.values), columns=df.columns).groupby(['ID_1','ID_2']).transform('sum'))

Alternatively use agg('sort') to sort and then groupby

df[df.filter(regex='^ID').columns] =df.filter(regex='^ID').agg('sort')
df['Count']=df.groupby(['ID_1','ID_2']).transform('sum')



    ID_1  ID_2  Count
0    55    62   2200
1    62    55   2200