I'm trying to filter my dataset so that only the rows that, for a given column, have values larger than the mean (or any other function) of that column.
For instance, suppose we have the following data frame:
import pandas as pd
df = pd.DataFrame({
"Group": ["A", "A", "A", "B", "B", "C"],
"C1": [1, 2, 3, 2, 3, 1],
"C2": [1, 1, 5, 1, 2, 1]
})
Group C1 C2
0 A 1 1
1 A 2 1
2 A 3 5
3 B 2 1
4 B 3 2
5 C 1 1
Now, I want to create other filtered data frames subsetting the original one based on some function. For example, let's use the mean as a baseline:
df.groupby("Group").mean()
C1 C2
Group
A 2.0 2.333333
B 2.5 1.500000
C 1.0 1.000000
Now, I want all points such values greater than or equal to the mean in column C1
:
Group C1 C2
1 A 2 1
2 A 3 5
4 B 3 2
5 C 1 1
Or I want a subset such that the values are less than or equal to the mean in column C2
:
Group C1 C2
0 A 1 1
1 A 2 1
3 B 2 1
5 C 1 1
To make this easier/more compact, it would be good to have a pandas on-liner fashion code, i.e., using the typical .
pipeline, using something like:
df.groupby("Group").filter(lambda x : x["C1"] >= x["C1"].mean())
Note that the code above doesn't work because filter
requires a function that returns returns True/False
but not a data frame to be combined, as I intend to.
Obviously, I can iterate using groupby
, filter the group, and then concatenate the results:
new_df = None
for _, group in df.groupby("Group"):
tmp = group[group["C1"] >= group["C1"].mean()]
new_df = pd.concat([new_df, tmp])
(Note: The >=
, otherwise we will have empty data frames messing up with the concatenation)
Same thing I can do in the other case:
new_df = None
for _, group in df.groupby("Group"):
tmp = group[group["C2"] <= group["C2"].mean()]
new_df = pd.concat([new_df, tmp])
But do we have a pandas-idiomatic (maybe generic, short, and probably optimized) way to do that?
Just for curiosity, I can do it very easily in R:
r$> df <- tibble(
Group = c("A", "A", "A", "B", "B", "C"),
C1 = c(1, 2, 3, 2, 3, 1),
C2 = c(1, 1, 5, 1, 2, 1)
)
r$> df
# A tibble: 6 × 3
Group C1 C2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 5
4 B 2 1
5 B 3 2
6 C 1 1
r$> df %>% group_by(Group) %>% filter(C1 >= mean(C1))
# A tibble: 4 × 3
# Groups: Group [3]
Group C1 C2
<chr> <dbl> <dbl>
1 A 2 1
2 A 3 5
3 B 3 2
4 C 1 1
r$> df %>% group_by(Group) %>% filter(C1 <= mean(C1))
# A tibble: 4 × 3
# Groups: Group [3]
Group C1 C2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 B 2 1
4 C 1 1
Thanks!
CodePudding user response:
IIUC, you can use groupby
and transform
to create a boolean series for boolean indexing your dataframe where column, C1 is greater than or equal to the mean of C1 by group:
df[df['C1']>=df.groupby("Group")['C1'].transform('mean')]
Output:
Group C1 C2
1 A 2 1
2 A 3 5
4 B 3 2
5 C 1 1