I have two pandas dataframes that are the same length and width. I want to find the %CV between each cell of the two dataframes (or the standard dev and mean, and then I can calculate CV by hand).
For example, if I have something like this:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
and
A B C
0 3 2 1
1 6 5 4
2 9 8 7
I want to return one dataframe containing the standard deviation/mean/CV between A1 in the first dataframe and A1 in the second dataframe.
Example (mean):
A B C
0 2 2 2
1 5 5 5
2 8 8 8
I want the same for standard deviation and then to calculate %CV via standard deviation / mean.
I have tried converting the dataframes to numpy
stan_dev = np.dstack((arr1.to_numpy(), arr2.to_numpy())).std(axis=2)
but I get this error:
TypeError: loop of ufunc does not support argument 0 of type float which has no callable sqrt method
Thank you!
CodePudding user response:
You can do this with either pandas
or numpy
. However, I am cautious about what you hope to infer based on the standard deviation/coefficient of variation based on 2 data points.
With that warning out of the way, you'll note that the 2 answers below have a similar structure-
- Combine datasets into a common container
- Perform aggregation computations
- Operate on those aggregations to calculate coefficient of variation
NumPy approach
stack
to combine datasets- aggregate with
axis=0
allows to aggregate across the lowest dimension - calculate coef. of variation
- reassemble results into a
DataFrame
import numpy as np
import pandas as pd
arr = np.stack((df1, df2))
cv_arr = arr.std(axis=0, ddof=1) / arr.mean(axis=0)
out = pd.DataFrame(data=cv_arr, columns=df1.columns, index=df1.index)
print(out)
A B C
0 0.707107 0.0 0.707107
1 0.282843 0.0 0.282843
2 0.176777 0.0 0.176777
Pure pandas approach
pd.concat
to combine datasets and specify thekeys
paramter to obtain use aMultiIndex
to represent data higher than 2d- groupby to operate across the groups
.swaplevels
to better leverage index alignment- use
.pipe
to calculate coef of variation
import pandas as pd
out = (
pd.concat([df1, df2], keys=[0, 1])
.groupby(level=1)
.agg(['mean', 'std'])
.swaplevel(axis=1)
.pipe(lambda df_: df_['std'] / df_['mean'])
)
print(out)
A B C
0 0.707107 0.0 0.707107
1 0.282843 0.0 0.282843
2 0.176777 0.0 0.176777
I'd personally recommend the numpy
solution because these data & analysis are better represented by a 3d array than a hierarchical DataFrame.