Get values in one column based on an AND condition in another column in python-CodePudding

I have the below dataframe in python with column names "Order_number" and "item_id" which represents the order number and the items consisting in that order. I need to fetch order numbers that consist of item_ids A AND B so my result should be Order Number 12345 and 84573

Order_number	item_id
12345	A
12345	B
34235	B
43543	B
84573	A
84573	B
45846	A

I have the below code but I am getting an error:

df[df['item_id'] == 'A' & df['item_id'] == 'B'] 

TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]

CodePudding user response：

You can groupby.apply a lambda that checks if the unique "item_id"s include both "A" and "B" for each "Order_number"; then filter the ones that do:

out = df.groupby('Order_number')['item_id'].apply(lambda x: {'A','B'}.issubset(x.unique().tolist())).pipe(lambda x: x.index[x]).tolist()

Another option is to use groupby.any twice; once for "A" and again for "B" to create boolean Series objects that return True if an item_id exists for an Order_number; then since we want both to exist, we use & and filter the "Order_number"s:

out = (df['item_id'].eq('A').groupby(df['Order_number']).any() & df['item_id'].eq('B').groupby(df['Order_number']).any()).pipe(lambda x: x.index[x].tolist())

Output:

[12345, 84573]

CodePudding user response：

I would solve as follows: Group by order number and create a set. Filter set that equal to what we want.


(
  df
   .groupby("Order_number")
   .agg(set)
   .loc[lambda d: d['item_id'].eq({'A','B'})]
   .index
   .values
)

# results 
# array([12345, 84573])

If it’s subset we want, assuming that there can be more than A and B, change the filter .loc to:

…
.loc[lambda d:d['item_id'].map({'A','B'}.issubset)]
…

# results 
# array([12345, 84573])

Oh, and if you want a list instead of array, change .values to tolist().