Home > OS >  Python pandas - After groupby, how to create new columns based on values in other columns
Python pandas - After groupby, how to create new columns based on values in other columns

Time:06-11

I have a df of the below format. Here the 'Count' column is the count of 'Col3'. Considering the first two rows, it has 2 counts of L and 1 count of W.

input_data = {
'Col1' : ['A','A','A','A','B','B','C'],
'Col2' : ['D','D','T','T','D','D','T'],
'Col3' : ['L','W','L','W','W','L','W'],
'Count': [2,1,3,2,3,2,2]
}
input_df = pd.DataFrame(input_data)
print(input_df)

enter image description here

I want to convert this df into the below format -required_df

output_data = {
'Col1' : ['A','A','B','C'],
'Col2' : ['D','T','D','T'],
'New_Col3' : ['L/W','L/W','L/W','W'],
'W_Count' : [1,3,3,2],
'L_Count' : [2,2,2,0]

}

enter image description here

i.e, the first 2 rows of the first df is converted into the first row of the required_df. For each unique set of ['Col1','Col2'], values of 'Col3' is joined with '/' and the count is added as 2 new columns W-Count and L_Count. Variation in last row, where there is only W value.

I know we need to do groupby(['Col1','Col2']) but unable to think after that to get the values of other two columns as reflected in required_df. What would be the best way to achieve this?

As the real data is sensitive, I cannot share it here. But the original data is large with lakhs of rows.

CodePudding user response:

You can combine a groupby.agg and pivot_table:

(df
 .groupby(['col1', 'col2'])
 .agg(**{'New_col3': ('col3', lambda x: '/'.join(sorted(x)))})
 .join(df.pivot_table(index=['col1', 'col2'],
                      columns='col3',
                      values='col4',
                      fill_value=0)
       .add_suffix('_count')
      )
 .reset_index()
 )

Output:

  col1 col2 New_col3  L_count  W_count
0    A    D      L/W        2        1
1    A    T      L/W        3        2
2    B    D      L/W        2        3
3    C    T        W        0        2

Used input:

df = pd.DataFrame({'col1': list('AAAABBC'),
                   'col2': list('DDTTDDT'),
                   'col3': list('LWLWWLW'),
                   'col4': (2,1,3,2,3,2,2)})
  • Related