I have a simple df. It has two columns. I want to groupby the values based on column a. Here is a simple example: Any input would be greatly appreciated!
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 1, 2]
df['b'] = [10, 20, 30,40, 50,60]
Desired output is:
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4]
df['b'] = [10, 20, 30,40 ]
df['b1'] = [50, 60, np.nan, np.nan ]
df
CodePudding user response:
Here's a way to do what you want. First you want to group by column 'a'. Normally groupby
is used to calculate group aggregation functions:
df.groupby('a')['b'].mean()
but in this case we want to keep the values of b associated with each a. You can use
[(a,list(b)) for a,b in df.groupby('a')['b']]
[(1, [10, 50]), (2, [20, 60]), (3, [30]), (4, [40])]
Conversion of this to a dataframe almost gets us there:
df2 = pd.DataFrame([(a,list(b)) for a,b in df.groupby('a')['b']],
columns=['a','temp'])
a temp
0 1 [10, 50]
1 2 [20, 60]
2 3 [30]
3 4 [40]
The column temp
can be separated into different columns with to_list
:
pd.DataFrame(df2['temp'].to_list())
0 1
0 10 50.0
1 20 60.0
2 30 NaN
3 40 NaN
Rejoin the output dataframes:
df2.join(df3)
a temp 0 1
0 1 [10, 50] 10 50.0
1 2 [20, 60] 20 60.0
2 3 [30] 30 NaN
3 4 [40] 40 NaN
And clean up (remove temp
column, rename columns and you probably want to do something about the integers cast to floats in the last column due to the NaN
s)
I'm sure there's a cleaner way to do this but hopefully this gets you started!