When there are identical values pivot_table takes the mean
(cause aggfunc='mean'
by default)
For instance:
d=pd.DataFrame(data={
'x_values':[13.4,13.08,12.73,12.,33.,23.,12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']})
print(pd.pivot_table(d, index='x_values',
columns='experiment', values='y_values',sort=False))
returns:
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.5
33.00 NaN 4.0
23.00 NaN 4.0
As you can see a new value in f appears (2.5 which is the mean of 2. and 3).
But I want to keep the list as it was in my pandas
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
How can I do it ?
I have tried to play with aggfunc=list
followed by an explode
but in this case the order is lost ...
Thanks
CodePudding user response:
A workaround would be to select data for each unique experiment value and then concat all these data:
pd.concat([d.loc[d.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in d.experiment.unique()])
Result:
x_values e f
0 13.40 1.54 NaN
1 13.08 1.47 NaN
2 12.73 1.00 NaN
3 12.00 NaN 2.0
4 33.00 NaN 4.0
5 23.00 NaN 4.0
6 12.00 NaN 3.0
CodePudding user response:
You could also just assign new variables and fill them according to boolean masks:
df=pd.DataFrame(
data={
'x_values':[13.4, 13.08, 12.73, 12., 33., 23., 12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']
}
)
df['e'] = df.loc[df['experiment'] == 'e', 'y_values']
df['f'] = df.loc[df['experiment'] == 'f', 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
-------------------------------------------------
e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
-------------------------------------------------
If you have more than one attribute for the column experiment
, you can iterate over all unique values:
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
which results to the desired output.
This approach appears to be more efficient than the approach provided by @Stef. However, with the cost of more lines of code.
from time import time
first_approach = []
for i in range(1000):
start = time()
pd.concat([df.loc[df.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in df.experiment.unique()]).set_index(['x_values'])
first_approach.append(time()-start)
second_approach = []
for i in range(1000):
start = time()
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
second_approach.append(time()-start)
print(f'Average Time First Approach:\t{sum(first_approach)/len(first_approach):.5f}')
print(f'Average Time Second Approach:\t{sum(second_approach)/len(second_approach):.5f}')
--------------------------------------------
Average Time First Approach: 0.00403
Average Time Second Approach: 0.00205
--------------------------------------------