Home > Back-end >  Calculate %CV for 2 pandas dataframes (or just standard deviation and mean)
Calculate %CV for 2 pandas dataframes (or just standard deviation and mean)

Time:06-21

I have two pandas dataframes that are the same length and width. I want to find the %CV between each cell of the two dataframes (or the standard dev and mean, and then I can calculate CV by hand).

For example, if I have something like this:

   A B C
0  1 2 3
1  4 5 6 
2  7 8 9 

and

   A B C
0  3 2 1
1  6 5 4
2  9 8 7

I want to return one dataframe containing the standard deviation/mean/CV between A1 in the first dataframe and A1 in the second dataframe.

Example (mean):

   A B C
0  2 2 2 
1  5 5 5 
2  8 8 8

I want the same for standard deviation and then to calculate %CV via standard deviation / mean.

I have tried converting the dataframes to numpy

stan_dev = np.dstack((arr1.to_numpy(), arr2.to_numpy())).std(axis=2)

but I get this error:

TypeError: loop of ufunc does not support argument 0 of type float which has no callable sqrt method

Thank you!

CodePudding user response:

You can do this with either pandas or numpy. However, I am cautious about what you hope to infer based on the standard deviation/coefficient of variation based on 2 data points.

With that warning out of the way, you'll note that the 2 answers below have a similar structure-

  1. Combine datasets into a common container
  2. Perform aggregation computations
  3. Operate on those aggregations to calculate coefficient of variation

NumPy approach

  • stack to combine datasets
  • aggregate with axis=0 allows to aggregate across the lowest dimension
  • calculate coef. of variation
  • reassemble results into a DataFrame
import numpy as np
import pandas as pd

arr = np.stack((df1, df2))
cv_arr = arr.std(axis=0, ddof=1) / arr.mean(axis=0)
out = pd.DataFrame(data=cv_arr, columns=df1.columns, index=df1.index)

print(out)
          A    B         C
0  0.707107  0.0  0.707107
1  0.282843  0.0  0.282843
2  0.176777  0.0  0.176777

Pure pandas approach

  • pd.concat to combine datasets and specify the keys paramter to obtain use a MultiIndex to represent data higher than 2d
  • groupby to operate across the groups
  • .swaplevels to better leverage index alignment
  • use .pipe to calculate coef of variation
import pandas as pd

out = (
    pd.concat([df1, df2], keys=[0, 1])
    .groupby(level=1)
    .agg(['mean', 'std'])
    .swaplevel(axis=1)
    .pipe(lambda df_: df_['std'] / df_['mean'])
)

print(out)
          A    B         C
0  0.707107  0.0  0.707107
1  0.282843  0.0  0.282843
2  0.176777  0.0  0.176777

I'd personally recommend the numpy solution because these data & analysis are better represented by a 3d array than a hierarchical DataFrame.

  • Related