I have a pandas DataFrame with 3 columns :
id product_id is_opt
1 1 False
1 2 False
1 3 True
1 4 True
2 5 False
2 6 False
2 7 False
3 8 False
3 9 False
3 10 True
I want to transform this DataFrame this way :
- For a set of rows that shares the same
id
, if all rows areis_opt = False
, then the set of rows stays unchanged. For example, the rows withid = 2
do not change. - For a set of rows that shares the same
id
, if at least one row isis_opt = True
, then we apply this transformation:- All rows that are
is_opt = True
stay unchanged. - All rows that are
is_opt = False
take at the end of theirproduct_id
all theproduct_id
s of the rows that areis_opt = True
. If there are n rowsis_opt = True
, then 1 row withis_opt = False
gives n rows. For exemple, the first row[1, 1, False]
gives 2 rows[1, 1-3, False]
and[1, 1-4, False]
.
- All rows that are
The expected output for the example is:
id product_id
1 1-3
1 1-4
1 2-3
1 2-4
1 3
1 4
2 5
2 6
2 7
3 8-10
3 9-10
3 10
is_opt
column has been droped in the expected result.
Can you help me with a way to get this result in an efficient set of operations ? It is straightforward with some for loops but I would like something efficient because the DataFrames in production are huge.
CodePudding user response:
You can use a custom function and itertools.product
:
from itertools import product
def combine(df):
if df['is_opt'].any():
a = df.loc[~df['is_opt'], 'product_id']
b = df.loc[df['is_opt'], 'product_id']
l = ['-'.join(map(str, p)) for p in product(a, b)]
return pd.Series(l b.tolist())
return df['product_id']
out = df.groupby('id').apply(combine).droplevel(1).reset_index(name='product_id')
output:
id product_id
0 1 1-3
1 1 1-4
2 1 2-3
3 1 2-4
4 1 3
5 1 4
6 2 5
7 2 6
8 2 7
9 3 8-10
10 3 9-10
11 3 10