I have a dataset where the grouping is made based on two columns: code and group. Sample data can be generated as follows:
import pandas as pd
# Sample dataframe
df = pd.DataFrame({'code': [12] * 5 [20] * 5,
'group': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'B'],
'options': ['x,y', 'x', 'x', 'y', 'y', 'z', 'z', 'x', 'y', 'z']})
print(df)
code group options
0 12 A x,y
1 12 A x
2 12 A x
3 12 B y
4 12 B y
5 20 A z
6 20 A z
7 20 B x
8 20 B y
9 20 B z
First thing I do is generate a new column that contains, for each group, all the possible options. I have not been able to do this in one single step but here is what I do:
# First generate a new column joining all the options by group in temporary strings
df['group_options'] = df.groupby(['code','group'])['options'].transform(lambda x: ','.join(x))
# Transform these temporary strings into lists containing unique values
df['group_options'] = df['group_options'].map(lambda x: list(set([option for temp_str in x.split(',') for option in temp_str])))
Result:
code group options group_options
0 12 A x,y [x, y]
1 12 A x [x, y]
2 12 A x [x, y]
3 12 B y [y]
4 12 B y [y]
5 20 A z [z]
6 20 A z [z]
7 20 B x [x, z, y]
8 20 B y [x, z, y]
9 20 B z [x, z, y]
Now I want to generate for later use two new columns, group_a_options
and group_b_options
, and these columns should contain for each code
group the data within group_options
:
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, z, y] [z] [x, y, z]
8 20 B y [x, z, y] [z] [x, y, z]
9 20 B z [x, z, y] [z] [x, y, z]
I have been trying to generate this new column using groupby
and transform
with no luck. How can I add a condition for the column group
to groupby
to obtain the desired output? Any help is appreciated.
CodePudding user response:
First is create Series
with set
s by joined values by ,
and split, last convert to list
s:
s = df.groupby(['code','group'])['options'].agg(lambda x: list(set(','.join(x).split(','))))
Then reshape by Series.unstack
and change colums nnames:
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
Last use DataFrame.join
by both columns and then for column code
:
df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [y, x] [y, x] [y]
1 12 A x [y, x] [y, x] [y]
2 12 A x [y, x] [y, x] [y]
3 12 B y [y] [y, x] [y]
4 12 B y [y] [y, x] [y]
5 20 A z [z] [z] [y, x, z]
6 20 A z [z] [z] [y, x, z]
7 20 B x [y, x, z] [z] [y, x, z]
8 20 B y [y, x, z] [z] [y, x, z]
9 20 B z [y, x, z] [z] [y, x, z]
If ordering is important deduplicate values by dict.fromkeys
trick:
s = (df.groupby(['code','group'])['options']
.agg(lambda x: list(dict.fromkeys(','.join(x).split(',')))))
df1 = s.unstack().add_prefix('group_').add_suffix('_options').rename(columns=str.lower)
df = df = df.join(s.rename('group_options'), on=['code','group']).join(df1, on='code')
print(df)
code group options group_options group_a_options group_b_options
0 12 A x,y [x, y] [x, y] [y]
1 12 A x [x, y] [x, y] [y]
2 12 A x [x, y] [x, y] [y]
3 12 B y [y] [x, y] [y]
4 12 B y [y] [x, y] [y]
5 20 A z [z] [z] [x, y, z]
6 20 A z [z] [z] [x, y, z]
7 20 B x [x, y, z] [z] [x, y, z]
8 20 B y [x, y, z] [z] [x, y, z]
9 20 B z [x, y, z] [z] [x, y, z]