Home > Blockchain >  Unwanted mean in pivot_table when values are identical
Unwanted mean in pivot_table when values are identical

Time:06-16

When there are identical values pivot_table takes the mean (cause aggfunc='mean' by default)

For instance:

d=pd.DataFrame(data={
'x_values':[13.4,13.08,12.73,12.,33.,23.,12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']})
print(pd.pivot_table(d, index='x_values', 
        columns='experiment', values='y_values',sort=False))

returns:

experiment     e    f
x_values             
13.40       1.54  NaN
13.08       1.47  NaN
12.73       1.00  NaN
12.00        NaN  2.5
33.00        NaN  4.0
23.00        NaN  4.0

As you can see a new value in f appears (2.5 which is the mean of 2. and 3).

But I want to keep the list as it was in my pandas

experiment     e    f
x_values             
13.40       1.54  NaN
13.08       1.47  NaN
12.73       1.00  NaN
12.00        NaN  2.0
33.00        NaN  4.0
23.00        NaN  4.0
12.00        NaN  3.0

How can I do it ?
I have tried to play with aggfunc=list followed by an explode but in this case the order is lost ...

Thanks

CodePudding user response:

A workaround would be to select data for each unique experiment value and then concat all these data:

pd.concat([d.loc[d.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in d.experiment.unique()])

Result:

   x_values     e    f
0     13.40  1.54  NaN
1     13.08  1.47  NaN
2     12.73  1.00  NaN
3     12.00   NaN  2.0
4     33.00   NaN  4.0
5     23.00   NaN  4.0
6     12.00   NaN  3.0

CodePudding user response:

You could also just assign new variables and fill them according to boolean masks:

df=pd.DataFrame(
    data={
        'x_values':[13.4, 13.08, 12.73, 12., 33., 23., 12.],
        'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
        'experiment':['e', 'e', 'e', 'f', 'f','f','f']
    }
)

df['e'] = df.loc[df['experiment'] == 'e', 'y_values']
df['f'] = df.loc[df['experiment'] == 'f', 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final


-------------------------------------------------
           e    f
x_values        
13.40     1.54  NaN
13.08     1.47  NaN
12.73     1.00  NaN
12.00     NaN   2.0
33.00     NaN   4.0
23.00     NaN   4.0
12.00     NaN   3.0
-------------------------------------------------

If you have more than one attribute for the column experiment, you can iterate over all unique values:

for experiment in df['experiment'].unique():
    df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final

which results to the desired output.

This approach appears to be more efficient than the approach provided by @Stef. However, with the cost of more lines of code.

from time import time

first_approach = []
for i in range(1000):
    start = time()
    pd.concat([df.loc[df.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in df.experiment.unique()]).set_index(['x_values'])
    first_approach.append(time()-start)
    
second_approach = []
for i in range(1000):
    start = time()
    for experiment in df['experiment'].unique():
        df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
    df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
    second_approach.append(time()-start)
    
print(f'Average Time First Approach:\t{sum(first_approach)/len(first_approach):.5f}')
print(f'Average Time Second Approach:\t{sum(second_approach)/len(second_approach):.5f}')

--------------------------------------------
Average Time First Approach:    0.00403
Average Time Second Approach:   0.00205
--------------------------------------------
  • Related