I have the below dataframe in python with column names "Order_number" and "item_id" which represents the order number and the items consisting in that order. I need to fetch order numbers that consist of item_ids A AND B so my result should be Order Number 12345 and 84573
Order_number | item_id |
---|---|
12345 | A |
12345 | B |
34235 | B |
43543 | B |
84573 | A |
84573 | B |
45846 | A |
I have the below code but I am getting an error:
df[df['item_id'] == 'A' & df['item_id'] == 'B']
TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]
CodePudding user response:
You can groupby.apply
a lambda that checks if the unique "item_id"s include both "A" and "B" for each "Order_number"; then filter the ones that do:
out = df.groupby('Order_number')['item_id'].apply(lambda x: {'A','B'}.issubset(x.unique().tolist())).pipe(lambda x: x.index[x]).tolist()
Another option is to use groupby.any
twice; once for "A" and again for "B" to create boolean Series objects that return True if an item_id
exists for an Order_number
; then since we want both to exist, we use &
and filter the "Order_number"s:
out = (df['item_id'].eq('A').groupby(df['Order_number']).any() & df['item_id'].eq('B').groupby(df['Order_number']).any()).pipe(lambda x: x.index[x].tolist())
Output:
[12345, 84573]
CodePudding user response:
I would solve as follows: Group by order number
and create a set. Filter set that equal to what we want.
(
df
.groupby("Order_number")
.agg(set)
.loc[lambda d: d['item_id'].eq({'A','B'})]
.index
.values
)
# results
# array([12345, 84573])
If it’s subset we want, assuming that there can be more than A and B, change the filter .loc
to:
…
.loc[lambda d:d['item_id'].map({'A','B'}.issubset)]
…
# results
# array([12345, 84573])
Oh, and if you want a list instead of array, change .values
to tolist()
.