Filter dataframe with values repeated based on condition-CodePudding

Let's say I have a dataframe, df_products, like this one:

product_id	version	month
1111	v1	jan
1111	v2	feb
1111	v2	jan
2222	v1	mar
3333	v2	jan
3333	v2	dec
4444	v1	jun
4444	v1	jan
4444	v2	jun

How can I filter it to get only the product_ids that have at least one v1 AND one v2 in the version column? I'd like to get something like this:

product_id
1111
4444

CodePudding user response：

Using set operations:

s = df.groupby('product_id')['version'].agg(set)

out = s[s >= {'v1', 'v2'}].index.tolist()

output:

[1111, 4444]

CodePudding user response：

Here are two approaches:

One uses groupby and filter to remove groups that don't have both v1 and v2

df.groupby('product_id').filter(lambda x: all(v in x['version'].values for v in ['v1','v2']))['product_id'].unique().tolist()

Another idea would be to use a pivot table and dropna.

df.pivot_table(index='product_id', columns='version', values='month', aggfunc=set).dropna().index.tolist()

CodePudding user response：

I think you can create two additional columns v1 and v2 and mark 1 as present and 0 if not using

df['v1'] = 0
df.loc[df['version'] == 'v1', 'v1'] = 1

df['v2'] = 0
df.loc[df['version'] == 'v2', 'v2'] = 1

Then you can just use loc to generate the results that you need

result = df.loc[(df['v1'] == 1) & (df['v2'] == 1)]

Lastly, if you don't want the debug columns, drop them

df = df.drop(columns=['v1', 'v2'])

CodePudding user response：

here is one way to do it

# values that should exist
v={'v1', 'v2'}

# group on product, and using transform aggregate the versions as set
# compare with the the valid list of version
# drop duplicates from the result
# return the product id

(df.loc[df.groupby(['product_id'] )['version']
        .transform(lambda x: set(x) == v )]
 .drop_duplicates(subset=['product_id'])['product_id'])

0    1111
6    4444
Name: product_id, dtype: int64