Lambda function to calculate mean of mean-CodePudding

My dataframe has 2 levels of index and I can calculate mean value for each primary index using the mean of mean method mean2 = df.groupby(level=['index1']).mean().mean(axis=1). I saw another method using lambda function and it results the same value. I just can't understand what is going on inside the apply(lambda).

Any explanation is very much appreciated.

import numpy as np
arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
s = pd.Series(np.random.randn(8), index=arrays)
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.index.names = ['index1', 'index2']
df


#Method 1: Mean for the whole date
mean1 = df.groupby(level='index1').apply(lambda cormat: cormat.values.mean())
# MEthod 2: Mean of mean
mean2 = df.groupby(level=['index1']).mean().mean(axis=1)

print(mean1,mean2)

CodePudding user response：

From GroupBy.apply documentation:

The function passed to apply must take a dataframe as its first argument

Function that you provide to apply() takes the whole group's dataframe as an argument. So calling values.mean() will calculate the mean across all rows/columns of the group. In method #1 you group dataframe by first index, then for each group you calculate mean across all rows/values within each group and then the result is joined together in a Series.

CodePudding user response：

In Python, lambdas are anonymous functions; thus, it has no difference that having a separate function, such as:

def first_mean(cormat):
    return cormat.values.mean()


mean1 = df.groupby(level='index1').apply(first_mean)

In the anonymous function: lambda cormat: cormat.values.mean()

cormat: argument name
cormat.values.mean(): return value

In other words, cormat is just a name. For example, I use x (even though it is not good for code readability).

The reason to use lambdas is the convenience. This guy just didn't bother himself to define a separate function to calculate the first mean and passed a lambda.

As from the pandas perspective, .groupby() method returns an iterable (a list of tuples) and .apply() method applies the function to each of small part of the iterable. In other words, you can write the whole .apply(lambda ...) part as follow:

def first_mean(cormat):
    return cormat.values.mean()


groups = df.groupby(level='index1')
t_dfs = []  # container to store each processed chunk of groupby
for name, group in groups:
    t_dfs.append(first_mean(group))

result = pd.concat(t_dfs)

Hope this helps.

CodePudding user response：

First, let's consider the following data and see the output:

data = {
    1 : [10,20,15,15],
    2 : [10,12,11,11]
}
df = pd.DataFrame(data)

mean = df.apply(lambda x: x.mean())
print(mean)

Output:

1    15.0
2    11.0

As you know, generally, it can be said that lambda is a kind of for on rows. (I don't know how scientifically correct I am, but this is what I feel) That's why many times when we want to perform for on the dataframe (especially when we deal with conditional expressions) we use lambda. When we run something like df.apply(lambda x: x.mean()) it means to move on the row and give the mean of the row

This is your dataframe:

                      0         1         2         3
index1 index2                                        
bar    one    -0.670105  0.007948  1.016790  0.539176
       two     0.025020  0.751342 -0.402003 -1.279099
baz    one    -0.379699  1.834577 -0.106809 -0.105114
       two     0.341889  0.697291  0.640217 -0.264288
foo    one     0.985640  1.079954 -1.079756 -0.252929
       two     0.559913  1.254874 -0.387722 -0.992791
qux    one    -0.192437 -0.522757 -0.638837  1.826321
       two    -2.106791 -0.280402  0.593201  0.824298

With this code, we have the following output:

#Method 1: Mean for the whole date
mean1 = df.groupby(level='index1').apply(lambda x: x.mean())

# MEthod 2: Mean of mean
mean2 = df.groupby(level=['index1']).mean().mean(axis=1)

print(mean1,'\n-----------------------------------------------\n',mean2)

Output:

               0         1         2         3
index1                                        
bar    -0.322543  0.379645  0.307393 -0.369962
baz    -0.018905  1.265934  0.266704 -0.184701
foo     0.772776  1.167414 -0.733739 -0.622860
qux    -1.149614 -0.401579 -0.022818  1.325309 
-----------------------------------------------
 index1
bar   -0.001367
baz    0.332258
foo    0.145898
qux   -0.062175
dtype: float64

So it seems, based on the second index, the mean were calculated.

With the code below, it becomes clear how values works.

values = df.groupby(level='index1').apply(lambda x: x.values)
print(values[0])

Output:

[[-0.6701054   0.00794845  1.01678958  0.5391757 ]
 [ 0.02501959  0.75134187 -0.40200268 -1.27909938]]

Now maybe it makes sense a little better:

#Method 1: Mean for the whole date
mean1 = df.groupby(level='index1').apply(lambda x: x.values.mean())

# MEthod 2: Mean of mean
mean2 = df.groupby(level=['index1']).mean().mean(axis=1)

print(mean1,'\n-----------------------------------------------\n',mean2)

Ouput:

index1
bar   -0.001367
baz    0.332258
foo    0.145898
qux   -0.062175
dtype: float64 
-----------------------------------------------
 index1
bar   -0.001367
baz    0.332258
foo    0.145898
qux   -0.062175
dtype: float64