How can I sum multiple rows of a dataset into one in R? based on column value-CodePudding

I have a dataset that lists botanical families (col 1) and the number of plant species within each of them (col 2). I want to create an histogram to show in a visually nice way which are the most species rich families in my dataset. However, since the dataset lists several hundreds of families, I would like to condense into a single row (named "other") all the families that contain less than 40 species.

I have something like this:

Family	Species_n
Myrtaceae	234
Fabaceae	156
Rosaceae	111
Moraceae	30
Rubiaceae	24
Poaceae	23

And I would need to obtain something like this

Family	Species_n
Myrtaceae	234
Fabaceae	156
Rosaceae	111
Others	77

Does anyone know how to do this? I've tried a few functions like "Group_by" and "group_data" but they don't seem to do what I need.

Thanks!!

CodePudding user response：

You can change the value of Family if Species_n is less than 40 and aggregate.

aggregate(Species_n~Family, 
          transform(df, Family = ifelse(Species_n <= 40, 'Other', Family)), sum)

#     Family Species_n
#1  Fabaceae       156
#2 Myrtaceae       234
#3     Other        77
#4  Rosaceae       111

forcats has a function to do this fct_lump_min

library(dplyr)
library(forcats)

df %>%
  group_by(Family = fct_lump_min(Family, 40, Species_n)) %>%
  summarise(Species_n = sum(Species_n))

data

df <- structure(list(Family = c("Myrtaceae", "Fabaceae", "Rosaceae", 
"Moraceae", "Rubiaceae", "Poaceae"), Species_n = c(234L, 156L, 
111L, 30L, 24L, 23L)), row.names = c(NA, -6L), class = "data.frame")

CodePudding user response：

We can use case_when

library(dplyr)
df %>%
    group_by(Family = case_when(Species_n <= 40 ~ 'Other', 
        TRUE ~ Family)) %>%
   summarise(Species_n = sum(Species_n))
# A tibble: 4 × 2
  Family    Species_n
  <chr>         <int>
1 Fabaceae        156
2 Myrtaceae       234
3 Other            77
4 Rosaceae        111

data

df <- structure(list(Family = c("Myrtaceae", "Fabaceae", "Rosaceae", 
"Moraceae", "Rubiaceae", "Poaceae"), Species_n = c(234L, 156L, 
111L, 30L, 24L, 23L)), row.names = c(NA, -6L), class = "data.frame")