import pandas as pd
dt = {'order_id': ['A','A','B','B','B','C'], 'XY_ID': [4,5,4,5,6,4]}
print(pd.DataFrame(data=dt))
order_id | XY_ID |
---|---|
A | 4 |
A | 5 |
B | 4 |
B | 5 |
B | 6 |
C | 4 |
and i want by output like below which is group all unique tuples of two from Col2 and count the number of Col1 associated with them
for 4,5 pairs A->4,B->4, A->5,B->5 count is 2, because there is 2 common relation via 4,
for 5,6 pairs A->5, B->5 B->6 count is 1, because there is 1 common relation via 5
XY_ID_Tuple_IDX1 | XY_ID_Tuple_ID2 | order count |
---|---|---|
4 | 5 | 2 |
5 | 6 | 1 |
4 | 6 | 1 |
i tried with df.groupby(['col1', 'col2']).size().reset_index(name='count')
and pivot_table()
CodePudding user response:
Here is an approach using itertools.combinations
:
from itertools import combinations
s = (df.groupby('order_id')['XY_ID']
.agg(lambda x: list(combinations(x, 2)))
.explode()
)
out = s.groupby(s).count()
output:
XY_ID
(4, 5) 2
(4, 6) 1
(5, 6) 1
Name: XY_ID, dtype: int64
Provided format:
# code above
idx = pd.MultiIndex.from_tuples(out.index,
names=['XY_ID_Tuple_IDX1', 'XY_ID_Tuple_IDX2'])
out2 = out.to_frame('order_id').set_axis(idx).reset_index()
output:
XY_ID_Tuple_IDX1 XY_ID_Tuple_IDX2 order_id
0 4 5 2
1 4 6 1
2 5 6 1
CodePudding user response:
The trick is to join the dataframe with itself:
counts = (df.merge(df, on='order_id')
.groupby(['XY_ID_x', 'XY_ID_y'])
.count()
.reset_index()
)
counts.loc[counts['XY_ID_x'] < counts['XY_ID_y']]
counts
content:
XY_ID_x XY_ID_y order_id
1 4 5 2
2 4 6 1
5 5 6 1