I have a dataset which contains orders, and items in those orders. What I want to find is which item pairs exist together in which group. For example, I want to find orders which has 6395477 and 6391546 together, in this case order 20220627-0015 and 20220627-0014 have them.
ORDER_ID ITEM_ID
20220627-0015 6395477
20220627-0015 6391546
20220627-0015 6385826
20220627-0015 6385822
20220627-0015 6385810
20220627-0015 6385797
20220627-0015 6379808
20220627-0014 6395477
20220627-0014 6391546
20220627-0014 6386347
20220627-0014 6385814
20220627-0014 6385812
20220627-0014 6385810
20220627-0014 6384898
20220627-0013 6395592
20220627-0013 6395583
20220627-0013 6394927
20220627-0013 6386403
20220627-0013 6384293
20220627-0013 6380162
20220627-0013 6379830
20220627-0013 6379828
CodePudding user response:
To do this, I created a list of the items you want to check for:
ITEM_IDS = [6395477, 6391546]
Then I filtered the dataframe using .loc
with .isin()
to filter on the rows that contain one of those 2 products:
df.loc[df['ITEM_ID'].isin(ITEM_IDS)]
ORDER_ID ITEM_ID
0 20220627-0015 6395477
1 20220627-0015 6391546
7 20220627-0014 6395477
8 20220627-0014 6391546
You can then group this up on ORDER_ID
, and use nunique()
to give the count of the number of unique ITEM_IDs
:
orders = df.loc[df['ITEM_ID'].isin(ITEM_IDS)].groupby('ORDER_ID', as_index=False).nunique()
ORDER_ID ITEM_ID
0 20220627-0014 2
1 20220627-0015 2
Then the ORDER_IDs
that contain all of the checked items are the rows where ITEM_ID == len(ITEM_IDS)
, which can be done as below, to return a list of ORDER_IDs
:
orders.loc[orders['ITEM_ID'] == len(ITEM_IDS), 'ORDER_ID'].to_list()
['20220627-0014', '20220627-0015']
CodePudding user response:
Assuming there is only one same item per order, you can use:
pair = (6395477, 6391546)
orders = df[df['ITEM_ID'].isin(pair)].value_counts('ORDER_ID').loc[lambda x: x == 2].index
print(orders)
# Output
Index(['20220627-0014', '20220627-0015'], dtype='object', name='ORDER_ID')
You can also use df.where
:
orders = (df.where(df['ITEM_ID'].isin(pair)).groupby('ORDER_ID')['ITEM_ID']
.nunique().loc[lambda x: x == 2].index)
CodePudding user response:
Solution for not test consecutive orders:
v = (6395477, 6391546)
df1 = df[df['ITEM_ID'].isin(v)]
print (df1)
ORDER_ID ITEM_ID
0 20220627-0015 6395477
1 20220627-0015 6391546
7 20220627-0014 6395477
8 20220627-0014 6391546
orders = df1.loc[df1['ORDER_ID'].duplicated(keep=False), 'ORDER_ID'].unique().tolist()
print (orders)
['20220627-0015', '20220627-0014']
If need all pairs to separately DataFrame:
from itertools import combinations
L = [(i, tuple(y)) for i, x in df.groupby('ORDER_ID')['ITEM_ID'] for y in combinations(x, 2)]
df = pd.DataFrame(L, columns=['a','b'])
print (df)
a b
0 20220627-0013 (6395592, 6395583)
1 20220627-0013 (6395592, 6394927)
2 20220627-0013 (6395592, 6386403)
3 20220627-0013 (6395592, 6384293)
4 20220627-0013 (6395592, 6380162)
.. ... ...
65 20220627-0015 (6385822, 6385797)
66 20220627-0015 (6385822, 6379808)
67 20220627-0015 (6385810, 6385797)
68 20220627-0015 (6385810, 6379808)
69 20220627-0015 (6385797, 6379808)
[70 rows x 2 columns]