Home > Enterprise >  Pandas new column based on condition after groupby
Pandas new column based on condition after groupby

Time:05-30

I have a dataset where the grouping is made based on two columns: code and group. Sample data can be generated as follows:

import pandas as pd
# Sample dataframe
df = pd.DataFrame({'code': [12] * 5   [20] * 5,
                  'group': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B'],
                  'options': ['x,y', 'x', 'x', 'y', 'y', 'z', 'z', 'x', 'y', 'z']})
print(df)

   code group options
0    12     A     x,y
1    12     A       x
2    12     A       x
3    12     B       y
4    12     B       y
5    20     A       z
6    20     A       z
7    20     B       x
8    20     B       y
9    20     B       z

First thing I do is generate a new column that contains, for each group, all the possible options. I have not been able to do this in one single step but here is what I do:

# First generate a new column joining all the options by group in temporary strings
df['group_options'] = df.groupby(['code','group'])['options'].transform(lambda x: ','.join(x))
# Transform these temporary strings into lists containing unique values
df['group_options'] = df['group_options'].map(lambda x: list(set([option for temp_str in x.split(',') for option in temp_str])))

Result:

   code group options group_options
0    12     A     x,y        [x, y]
1    12     A       x        [x, y]
2    12     A       x        [x, y]
3    12     B       y           [y]
4    12     B       y           [y]
5    20     A       z           [z]
6    20     A       z           [z]
7    20     B       x     [x, z, y]
8    20     B       y     [x, z, y]
9    20     B       z     [x, z, y]

Now I want to generate for later use two new columns, group_a_options and group_b_options, and these columns should contain for each code group the data within group_options:

   code group options group_options group_a_options group_b_options
0    12     A     x,y        [x, y]          [x, y]             [y]
1    12     A       x        [x, y]          [x, y]             [y]
2    12     A       x        [x, y]          [x, y]             [y]
3    12     B       y           [y]          [x, y]             [y]
4    12     B       y           [y]          [x, y]             [y]
5    20     A       z           [z]             [z]       [x, y, z]
6    20     A       z           [z]             [z]       [x, y, z]
7    20     B       x     [x, z, y]             [z]       [x, y, z]
8    20     B       y     [x, z, y]             [z]       [x, y, z]
9    20     B       z     [x, z, y]             [z]       [x, y, z]

I have been trying to generate this new column using groupby and transform with no luck. How can I add a condition for the column group to groupby to obtain the desired output? Any help is appreciated.

CodePudding user response:

First is create Series with sets by joined values by , and split, last convert to lists:

s = df.groupby(['code','group'])['options'].agg(lambda x: list(set(','.join(x).split(','))))

Then reshape by Series.unstack and change colums nnames:

df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)

Last use DataFrame.join by both columns and then for column code:

df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
   code group options group_options group_a_options group_b_options
0    12     A     x,y        [y, x]          [y, x]             [y]
1    12     A       x        [y, x]          [y, x]             [y]
2    12     A       x        [y, x]          [y, x]             [y]
3    12     B       y           [y]          [y, x]             [y]
4    12     B       y           [y]          [y, x]             [y]
5    20     A       z           [z]             [z]       [y, x, z]
6    20     A       z           [z]             [z]       [y, x, z]
7    20     B       x     [y, x, z]             [z]       [y, x, z]
8    20     B       y     [y, x, z]             [z]       [y, x, z]
9    20     B       z     [y, x, z]             [z]       [y, x, z]

If ordering is important deduplicate values by dict.fromkeys trick:

s = (df.groupby(['code','group'])['options']
       .agg(lambda x: list(dict.fromkeys(','.join(x).split(',')))))

df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)

df = df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
   code group options group_options group_a_options group_b_options
0    12     A     x,y        [x, y]          [x, y]             [y]
1    12     A       x        [x, y]          [x, y]             [y]
2    12     A       x        [x, y]          [x, y]             [y]
3    12     B       y           [y]          [x, y]             [y]
4    12     B       y           [y]          [x, y]             [y]
5    20     A       z           [z]             [z]       [x, y, z]
6    20     A       z           [z]             [z]       [x, y, z]
7    20     B       x     [x, y, z]             [z]       [x, y, z]
8    20     B       y     [x, y, z]             [z]       [x, y, z]
9    20     B       z     [x, y, z]             [z]       [x, y, z]
  • Related