Let's say I have a dataframe, df_products, like this one:
product_id | version | month |
---|---|---|
1111 | v1 | jan |
1111 | v2 | feb |
1111 | v2 | jan |
2222 | v1 | mar |
3333 | v2 | jan |
3333 | v2 | dec |
4444 | v1 | jun |
4444 | v1 | jan |
4444 | v2 | jun |
How can I filter it to get only the product_ids that have at least one v1 AND one v2 in the version column? I'd like to get something like this:
product_id |
---|
1111 |
4444 |
CodePudding user response:
Using set operations:
s = df.groupby('product_id')['version'].agg(set)
out = s[s >= {'v1', 'v2'}].index.tolist()
output:
[1111, 4444]
CodePudding user response:
Here are two approaches:
One uses groupby and filter to remove groups that don't have both v1 and v2
df.groupby('product_id').filter(lambda x: all(v in x['version'].values for v in ['v1','v2']))['product_id'].unique().tolist()
Another idea would be to use a pivot table and dropna.
df.pivot_table(index='product_id', columns='version', values='month', aggfunc=set).dropna().index.tolist()
CodePudding user response:
I think you can create two additional columns v1
and v2
and mark 1 as present and 0 if not using
df['v1'] = 0
df.loc[df['version'] == 'v1', 'v1'] = 1
df['v2'] = 0
df.loc[df['version'] == 'v2', 'v2'] = 1
Then you can just use loc to generate the results that you need
result = df.loc[(df['v1'] == 1) & (df['v2'] == 1)]
Lastly, if you don't want the debug columns, drop them
df = df.drop(columns=['v1', 'v2'])
CodePudding user response:
here is one way to do it
# values that should exist
v={'v1', 'v2'}
# group on product, and using transform aggregate the versions as set
# compare with the the valid list of version
# drop duplicates from the result
# return the product id
(df.loc[df.groupby(['product_id'] )['version']
.transform(lambda x: set(x) == v )]
.drop_duplicates(subset=['product_id'])['product_id'])
0 1111
6 4444
Name: product_id, dtype: int64