How to get mean and std from a dictionary of dataframe per each key-CodePudding

Here is my dilemma:

I got a dictionary of dataframes like this:

dict_df[key]

m1      m2  m3  m4  m5  m6  
10410   5   10  21  33  11
15387   3   10  33  45  13
19026   4   16  27  40  11
26083   5   21  16  29  9
27806   4   17  23  31  7
43820   2   12  27  40  18
49199   7   22  30  38  11
50094   4   9   13  18  4

Per each key, it returns a DF with the same column names.

For each key, I need to store the mean and standard deviation of a set of features (let's take for example m2, m3, m4).

In the end, I want to obtain something like this df below (the numbers are totally random):

key   m2_mean    m2_std   m3_mean   m3_std    m4_mean     m4_std
key1    12       55         793      438       44           95
key2    14       442        21       43        14           442
key3    44       1          66       11        42           42
key4    42       42         2        23        98           70

The dataset is not that big, so even if the code is slow should be fine.

Thanks for the help and have a good one!

CodePudding user response：

First, let's define some sample data:

>>> df1 = pd.DataFrame({
        "col1": [1, 2, 3],
        "col2": [4, 5, 6],
    })
>>> df2 = pd.DataFrame({
        "col1": [7, 8, 9],
        "col2": [10, 11, 12],
    })
>>> dict_df = {
        "df1": df1,
        "df2": df2,
    }

Now, you can use .agg() to calculate the mean and std of your dataframe (I've used max for simplicity), .stack() to reduce the dataframe into a single series, and .to_dict() to generate a representation of this result as a string. Notice that we'll only use one of the dataframes (df1) to show this result:

>>> df1.agg(["mean", "max"]).stack().to_dict()
{('mean', 'col1'): 2.0, ('mean', 'col2'): 5.0, ('max', 'col1'): 3.0, ('max', 'col2'): 6.0}

With this dict representation, we can use pd.DataFrame.from_dict to build a single dataframe with the metrics for each value in dict_df:

>>> df = pd.DataFrame.from_dict({
        df_name: df[["col1", "col2"]].agg(["mean", "max"]).stack().to_dict()
        for df_name, df in dict_df.items()
    }, orient="index")
>>> df
    mean        max      
    col1  col2 col1  col2
df1  2.0   5.0  3.0   6.0
df2  8.0  11.0  9.0  12.0

The only important difference with your expected output is in the column names, but we can solve that manually:

>>> df.columns = ["_".join(column) for column in df.columns]
>>> df
     mean_col1  mean_col2  max_col1  max_col2
df1        2.0        5.0       3.0       6.0
df2        8.0       11.0       9.0      12.0

Code that would do the trick for you:

>>> target_columns = ["m2", "m3", "m4"]
>>> df = pd.DataFrame.from_dict({
        key: df[target_columns].agg(["mean", "std"]).stack().to_dict()
        for key, df in dict_df.items()
    }, orient="index")
>>> df.columns = ["_".join(column) for column in df.columns]
>>> df.index.name = "key"

CodePudding user response：

Start with creating a dictionary for the aggregates, in the form of new_col_name: (dataframe_column_name, aggregare_function):

required_aggs = {f'{col_name}_{agg_name}': (col_name, agg_name)
                 for agg_name in ['mean', 'std']
                 for col_name in df}

Above gives following dictionary of the required aggregates, and the column names.

{
    'm1_mean': ('m1', 'mean'),
    'm1_std': ('m1', 'std'),
    'm2_mean': ('m2', 'mean'),
    'm2_std': ('m2', 'std'),
    'm3_mean': ('m3', 'mean'),
    'm3_std': ('m3', 'std'),
    'm4_mean': ('m4', 'mean'),
    'm4_std': ('m4', 'std'),
    'm5_mean': ('m5', 'mean'),
    'm5_std': ('m5', 'std'),
    'm6_mean': ('m6', 'mean'),
    'm6_std': ('m6', 'std')
}

Now, for each dataframe, assign the key column with the value of the key, then group by this column, and call .agg on groupby object passing unpacked version of above aggregate dictionary:

>>> df.assign(key=123).groupby('key').agg(**required_aggs)

       m1_mean  m2_mean  m3_mean  ...    m4_std    m5_std    m6_std
key                               ...                              
123  30228.125     4.25   14.625  ...  6.860862  8.447316  4.140393

PS: In above snippet, df is a single dataframe in dict_df[key], and value of this key is assumed to be 123. You need to do the same for each dataframes in the dictionary. Also, if you need to calculate aggregate of specific columns only, you just need to replace df in for col_name in df by list of columns e.g.: for col_name in ['m1', 'm2']