Home > Mobile >  How to filter a dataframe by the mean of each group using a on-liner pandas code
How to filter a dataframe by the mean of each group using a on-liner pandas code

Time:01-12

I'm trying to filter my dataset so that only the rows that, for a given column, have values larger than the mean (or any other function) of that column.

For instance, suppose we have the following data frame:

import pandas as pd

df = pd.DataFrame({
    "Group": ["A", "A", "A", "B", "B", "C"], 
    "C1": [1, 2, 3, 2, 3, 1], 
    "C2": [1, 1, 5, 1, 2, 1]
})
  Group  C1  C2
0     A   1   1
1     A   2   1
2     A   3   5
3     B   2   1
4     B   3   2
5     C   1   1

Now, I want to create other filtered data frames subsetting the original one based on some function. For example, let's use the mean as a baseline:

df.groupby("Group").mean()
        C1        C2
Group
A      2.0  2.333333
B      2.5  1.500000
C      1.0  1.000000

Now, I want all points such values greater than or equal to the mean in column C1:

  Group  C1  C2
1     A   2   1
2     A   3   5
4     B   3   2
5     C   1   1

Or I want a subset such that the values are less than or equal to the mean in column C2:

  Group  C1  C2
0     A   1   1
1     A   2   1
3     B   2   1
5     C   1   1

To make this easier/more compact, it would be good to have a pandas on-liner fashion code, i.e., using the typical . pipeline, using something like:

df.groupby("Group").filter(lambda x : x["C1"] >= x["C1"].mean())

Note that the code above doesn't work because filter requires a function that returns returns True/False but not a data frame to be combined, as I intend to.

Obviously, I can iterate using groupby, filter the group, and then concatenate the results:

new_df = None
for _, group in df.groupby("Group"):
    tmp = group[group["C1"] >= group["C1"].mean()]
    new_df = pd.concat([new_df, tmp])

(Note: The >=, otherwise we will have empty data frames messing up with the concatenation)

Same thing I can do in the other case:

new_df = None
for _, group in df.groupby("Group"):
    tmp = group[group["C2"] <= group["C2"].mean()]
    new_df = pd.concat([new_df, tmp])

But do we have a pandas-idiomatic (maybe generic, short, and probably optimized) way to do that?

Just for curiosity, I can do it very easily in R:

r$> df <- tibble(
        Group = c("A", "A", "A", "B", "B", "C"),
        C1 = c(1, 2, 3, 2, 3, 1),
        C2 = c(1, 1, 5, 1, 2, 1)
)

r$> df
# A tibble: 6 × 3
  Group    C1    C2
  <chr> <dbl> <dbl>
1 A         1     1
2 A         2     1
3 A         3     5
4 B         2     1
5 B         3     2
6 C         1     1

r$> df %>% group_by(Group) %>% filter(C1 >= mean(C1))
# A tibble: 4 × 3
# Groups:   Group [3]
  Group    C1    C2
  <chr> <dbl> <dbl>
1 A         2     1
2 A         3     5
3 B         3     2
4 C         1     1

r$> df %>% group_by(Group) %>% filter(C1 <= mean(C1))
# A tibble: 4 × 3
# Groups:   Group [3]
  Group    C1    C2
  <chr> <dbl> <dbl>
1 A         1     1
2 A         2     1
3 B         2     1
4 C         1     1

Thanks!

CodePudding user response:

IIUC, you can use groupby and transform to create a boolean series for boolean indexing your dataframe where column, C1 is greater than or equal to the mean of C1 by group:

df[df['C1']>=df.groupby("Group")['C1'].transform('mean')]

Output:

  Group  C1  C2
1     A   2   1
2     A   3   5
4     B   3   2
5     C   1   1
  • Related