Home > Software design >  Filter dataframe with values repeated based on condition
Filter dataframe with values repeated based on condition

Time:10-20

Let's say I have a dataframe, df_products, like this one:

product_id version month
1111 v1 jan
1111 v2 feb
1111 v2 jan
2222 v1 mar
3333 v2 jan
3333 v2 dec
4444 v1 jun
4444 v1 jan
4444 v2 jun

How can I filter it to get only the product_ids that have at least one v1 AND one v2 in the version column? I'd like to get something like this:

product_id
1111
4444

CodePudding user response:

Using set operations:

s = df.groupby('product_id')['version'].agg(set)

out = s[s >= {'v1', 'v2'}].index.tolist()

output:

[1111, 4444]

CodePudding user response:

Here are two approaches:

One uses groupby and filter to remove groups that don't have both v1 and v2

df.groupby('product_id').filter(lambda x: all(v in x['version'].values for v in ['v1','v2']))['product_id'].unique().tolist()

Another idea would be to use a pivot table and dropna.

df.pivot_table(index='product_id', columns='version', values='month', aggfunc=set).dropna().index.tolist()

CodePudding user response:

I think you can create two additional columns v1 and v2 and mark 1 as present and 0 if not using

df['v1'] = 0
df.loc[df['version'] == 'v1', 'v1'] = 1

df['v2'] = 0
df.loc[df['version'] == 'v2', 'v2'] = 1

Then you can just use loc to generate the results that you need

result = df.loc[(df['v1'] == 1) & (df['v2'] == 1)]

Lastly, if you don't want the debug columns, drop them

df = df.drop(columns=['v1', 'v2'])

CodePudding user response:

here is one way to do it

# values that should exist
v={'v1', 'v2'}

# group on product, and using transform aggregate the versions as set
# compare with the the valid list of version
# drop duplicates from the result
# return the product id

(df.loc[df.groupby(['product_id'] )['version']
        .transform(lambda x: set(x) == v )]
 .drop_duplicates(subset=['product_id'])['product_id'])
0    1111
6    4444
Name: product_id, dtype: int64
  • Related