Calculate %CV for 2 pandas dataframes (or just standard deviation and mean)-CodePudding

I have two pandas dataframes that are the same length and width. I want to find the %CV between each cell of the two dataframes (or the standard dev and mean, and then I can calculate CV by hand).

For example, if I have something like this:

and

I want to return one dataframe containing the standard deviation/mean/CV between A1 in the first dataframe and A1 in the second dataframe.

Example (mean):

I want the same for standard deviation and then to calculate %CV via standard deviation / mean.

I have tried converting the dataframes to numpy

stan_dev = np.dstack((arr1.to_numpy(), arr2.to_numpy())).std(axis=2)

but I get this error:

TypeError: loop of ufunc does not support argument 0 of type float which has no callable sqrt method

Thank you!

CodePudding user response：

You can do this with either pandas or numpy. However, I am cautious about what you hope to infer based on the standard deviation/coefficient of variation based on 2 data points.

With that warning out of the way, you'll note that the 2 answers below have a similar structure-

Combine datasets into a common container
Perform aggregation computations
Operate on those aggregations to calculate coefficient of variation

NumPy approach

stack to combine datasets
aggregate with axis=0 allows to aggregate across the lowest dimension
calculate coef. of variation
reassemble results into a DataFrame

import numpy as np
import pandas as pd

arr = np.stack((df1, df2))
cv_arr = arr.std(axis=0, ddof=1) / arr.mean(axis=0)
out = pd.DataFrame(data=cv_arr, columns=df1.columns, index=df1.index)

print(out)
          A    B         C
0  0.707107  0.0  0.707107
1  0.282843  0.0  0.282843
2  0.176777  0.0  0.176777

Pure pandas approach

pd.concat to combine datasets and specify the keys paramter to obtain use a MultiIndex to represent data higher than 2d
groupby to operate across the groups
.swaplevels to better leverage index alignment
use .pipe to calculate coef of variation

import pandas as pd

out = (
    pd.concat([df1, df2], keys=[0, 1])
    .groupby(level=1)
    .agg(['mean', 'std'])
    .swaplevel(axis=1)
    .pipe(lambda df_: df_['std'] / df_['mean'])
)

print(out)
          A    B         C
0  0.707107  0.0  0.707107
1  0.282843  0.0  0.282843
2  0.176777  0.0  0.176777

I'd personally recommend the numpy solution because these data & analysis are better represented by a 3d array than a hierarchical DataFrame.