How can I search for sub-groups of dataframe that contains specific pairs of data?-CodePudding

I have a dataset which contains orders, and items in those orders. What I want to find is which item pairs exist together in which group. For example, I want to find orders which has 6395477 and 6391546 together, in this case order 20220627-0015 and 20220627-0014 have them.

ORDER_ID     ITEM_ID
20220627-0015            6395477
20220627-0015            6391546
20220627-0015            6385826
20220627-0015            6385822
20220627-0015            6385810
20220627-0015            6385797
20220627-0015            6379808
20220627-0014            6395477
20220627-0014            6391546
20220627-0014            6386347
20220627-0014            6385814
20220627-0014            6385812
20220627-0014            6385810
20220627-0014            6384898
20220627-0013            6395592
20220627-0013            6395583
20220627-0013            6394927
20220627-0013            6386403
20220627-0013            6384293
20220627-0013            6380162
20220627-0013            6379830
20220627-0013            6379828

CodePudding user response：

To do this, I created a list of the items you want to check for:

ITEM_IDS = [6395477, 6391546]

Then I filtered the dataframe using .loc with .isin() to filter on the rows that contain one of those 2 products:

df.loc[df['ITEM_ID'].isin(ITEM_IDS)]

        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

You can then group this up on ORDER_ID, and use nunique() to give the count of the number of unique ITEM_IDs:

orders = df.loc[df['ITEM_ID'].isin(ITEM_IDS)].groupby('ORDER_ID', as_index=False).nunique()

        ORDER_ID  ITEM_ID
0  20220627-0014        2
1  20220627-0015        2

Then the ORDER_IDs that contain all of the checked items are the rows where ITEM_ID == len(ITEM_IDS), which can be done as below, to return a list of ORDER_IDs:

orders.loc[orders['ITEM_ID'] == len(ITEM_IDS), 'ORDER_ID'].to_list()

['20220627-0014', '20220627-0015']

CodePudding user response：

Assuming there is only one same item per order, you can use:

pair = (6395477, 6391546)
orders = df[df['ITEM_ID'].isin(pair)].value_counts('ORDER_ID').loc[lambda x: x == 2].index
print(orders)

# Output
Index(['20220627-0014', '20220627-0015'], dtype='object', name='ORDER_ID')

You can also use df.where:

orders = (df.where(df['ITEM_ID'].isin(pair)).groupby('ORDER_ID')['ITEM_ID']
            .nunique().loc[lambda x: x == 2].index)

CodePudding user response：

Solution for not test consecutive orders:

v = (6395477, 6391546)
df1 = df[df['ITEM_ID'].isin(v)]
print (df1)
        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

orders = df1.loc[df1['ORDER_ID'].duplicated(keep=False), 'ORDER_ID'].unique().tolist()
print (orders)
['20220627-0015', '20220627-0014']

If need all pairs to separately DataFrame:

from  itertools import combinations

L = [(i, tuple(y)) for i, x in df.groupby('ORDER_ID')['ITEM_ID'] for y in combinations(x, 2)]

df = pd.DataFrame(L, columns=['a','b'])
print (df)
                a                   b
0   20220627-0013  (6395592, 6395583)
1   20220627-0013  (6395592, 6394927)
2   20220627-0013  (6395592, 6386403)
3   20220627-0013  (6395592, 6384293)
4   20220627-0013  (6395592, 6380162)
..            ...                 ...
65  20220627-0015  (6385822, 6385797)
66  20220627-0015  (6385822, 6379808)
67  20220627-0015  (6385810, 6385797)
68  20220627-0015  (6385810, 6379808)
69  20220627-0015  (6385797, 6379808)

[70 rows x 2 columns]