Count and Sum of distinct values of the tuple pairs in python-CodePudding

import pandas as pd
dt = {'order_id': ['A','A','B','B','B','C'], 'XY_ID': [4,5,4,5,6,4]}
print(pd.DataFrame(data=dt))

order_id	XY_ID
A	4
A	5
B	4
B	5
B	6
C	4

and i want by output like below which is group all unique tuples of two from Col2 and count the number of Col1 associated with them

for 4,5 pairs A->4,B->4, A->5,B->5 count is 2, because there is 2 common relation via 4,
for 5,6 pairs A->5, B->5 B->6 count is 1, because there is 1 common relation via 5

XY_ID_Tuple_IDX1	XY_ID_Tuple_ID2	order count
4	5	2
5	6	1
4	6	1

i tried with df.groupby(['col1', 'col2']).size().reset_index(name='count') and pivot_table()

CodePudding user response：

Here is an approach using itertools.combinations:

from itertools import combinations

s = (df.groupby('order_id')['XY_ID']
       .agg(lambda x: list(combinations(x, 2)))
       .explode()
     )

out = s.groupby(s).count()

output:

XY_ID
(4, 5)    2
(4, 6)    1
(5, 6)    1
Name: XY_ID, dtype: int64

Provided format:

# code above  

idx = pd.MultiIndex.from_tuples(out.index,
                                names=['XY_ID_Tuple_IDX1', 'XY_ID_Tuple_IDX2'])
out2 = out.to_frame('order_id').set_axis(idx).reset_index()

output:

   XY_ID_Tuple_IDX1  XY_ID_Tuple_IDX2  order_id
0                 4                 5         2
1                 4                 6         1
2                 5                 6         1

CodePudding user response：

The trick is to join the dataframe with itself:

counts = (df.merge(df, on='order_id')
            .groupby(['XY_ID_x', 'XY_ID_y'])
            .count()
            .reset_index()
          )
counts.loc[counts['XY_ID_x'] < counts['XY_ID_y']]

counts content:

   XY_ID_x  XY_ID_y  order_id
1        4        5         2
2        4        6         1
5        5        6         1