I have a dataset that lists botanical families (col 1) and the number of plant species within each of them (col 2). I want to create an histogram to show in a visually nice way which are the most species rich families in my dataset. However, since the dataset lists several hundreds of families, I would like to condense into a single row (named "other") all the families that contain less than 40 species.
I have something like this:
Family | Species_n |
---|---|
Myrtaceae | 234 |
Fabaceae | 156 |
Rosaceae | 111 |
Moraceae | 30 |
Rubiaceae | 24 |
Poaceae | 23 |
And I would need to obtain something like this
Family | Species_n |
---|---|
Myrtaceae | 234 |
Fabaceae | 156 |
Rosaceae | 111 |
Others | 77 |
Does anyone know how to do this? I've tried a few functions like "Group_by" and "group_data" but they don't seem to do what I need.
Thanks!!
CodePudding user response:
You can change the value of Family
if Species_n
is less than 40 and aggregate
.
aggregate(Species_n~Family,
transform(df, Family = ifelse(Species_n <= 40, 'Other', Family)), sum)
# Family Species_n
#1 Fabaceae 156
#2 Myrtaceae 234
#3 Other 77
#4 Rosaceae 111
forcats
has a function to do this fct_lump_min
library(dplyr)
library(forcats)
df %>%
group_by(Family = fct_lump_min(Family, 40, Species_n)) %>%
summarise(Species_n = sum(Species_n))
data
df <- structure(list(Family = c("Myrtaceae", "Fabaceae", "Rosaceae",
"Moraceae", "Rubiaceae", "Poaceae"), Species_n = c(234L, 156L,
111L, 30L, 24L, 23L)), row.names = c(NA, -6L), class = "data.frame")
CodePudding user response:
We can use case_when
library(dplyr)
df %>%
group_by(Family = case_when(Species_n <= 40 ~ 'Other',
TRUE ~ Family)) %>%
summarise(Species_n = sum(Species_n))
# A tibble: 4 × 2
Family Species_n
<chr> <int>
1 Fabaceae 156
2 Myrtaceae 234
3 Other 77
4 Rosaceae 111
data
df <- structure(list(Family = c("Myrtaceae", "Fabaceae", "Rosaceae",
"Moraceae", "Rubiaceae", "Poaceae"), Species_n = c(234L, 156L,
111L, 30L, 24L, 23L)), row.names = c(NA, -6L), class = "data.frame")