Home > database >  recombine string columns based on another columns in pandas
recombine string columns based on another columns in pandas


I have a pandas DataFrame with 3 columns :

id  product_id  is_opt
1   1           False
1   2           False
1   3           True
1   4           True
2   5           False
2   6           False
2   7           False
3   8           False
3   9           False
3   10          True

I want to transform this DataFrame this way :

  • For a set of rows that shares the same id, if all rows are is_opt = False, then the set of rows stays unchanged. For example, the rows with id = 2 do not change.
  • For a set of rows that shares the same id, if at least one row is is_opt = True, then we apply this transformation:
    • All rows that are is_opt = True stay unchanged.
    • All rows that are is_opt = False take at the end of their product_id all the product_ids of the rows that are is_opt = True. If there are n rows is_opt = True, then 1 row with is_opt = False gives n rows. For exemple, the first row [1, 1, False] gives 2 rows [1, 1-3, False] and [1, 1-4, False].

The expected output for the example is:

id  product_id
1   1-3
1   1-4
1   2-3
1   2-4
1   3
1   4
2   5
2   6
2   7
3   8-10
3   9-10
3   10

is_opt column has been droped in the expected result.

Can you help me with a way to get this result in an efficient set of operations ? It is straightforward with some for loops but I would like something efficient because the DataFrames in production are huge.

CodePudding user response:

You can use a custom function and itertools.product:

from itertools import product

def combine(df):
    if df['is_opt'].any():
        a = df.loc[~df['is_opt'], 'product_id']
        b = df.loc[df['is_opt'], 'product_id']
        l = ['-'.join(map(str, p)) for p in product(a, b)]
        return pd.Series(l b.tolist())
    return df['product_id']

out = df.groupby('id').apply(combine).droplevel(1).reset_index(name='product_id')


    id product_id
0    1        1-3
1    1        1-4
2    1        2-3
3    1        2-4
4    1          3
5    1          4
6    2          5
7    2          6
8    2          7
9    3       8-10
10   3       9-10
11   3         10
  • Related