Home > Software design >  How to consider `NA` when counting using `group_by()` and two factor variables in R
How to consider `NA` when counting using `group_by()` and two factor variables in R

Time:01-10

I have a dataframe df1 in which I have different sites (df1$Site) belonging to different regions (df1$Regions) in which I have data about evidence of herbivory and its type (df1$Herbivory_type). When there is no herbivory then, df1$Herbivory_type is NA. Below I show an example of my dataframe:

df1 <- data.frame(Region=c("ALI1","ALI1","ALI1","ALI1","ALI2","ALI2","ALI2","ALI3","ALI3","ALI3","ALI3","ALI5","ALI5"),
                  Site=c("ALI1_A","ALI1_B","ALI1_C","ALI1_D","ALI2_A","ALI2_B","ALI2_C","ALI3_A","ALI3_B","ALI3_C","ALI3_D","ALI5_A","ALI5_B"),
                  Herbivory_type=c(NA,"S",NA,NA,NA,NA,NA,NA,"S","S",NA,NA,"S"))

df1$Herbivory_type <- as.factor(df1$Herbivory_type)

df1

   Region   Site Herbivory_type
1    ALI1 ALI1_A           <NA>
2    ALI1 ALI1_B              S
3    ALI1 ALI1_C           <NA>
4    ALI1 ALI1_D           <NA>
5    ALI2 ALI2_A           <NA>
6    ALI2 ALI2_B           <NA>
7    ALI2 ALI2_C           <NA>
8    ALI3 ALI3_A           <NA>
9    ALI3 ALI3_B              S
10   ALI3 ALI3_C              S
11   ALI3 ALI3_D           <NA>
12   ALI5 ALI5_A           <NA>
13   ALI5 ALI5_B              S

I need to know the number of episodes of herbivorism by Region considering NA in the counting in df1$Site. I would expect this result:

df2

   Region N_Hervivory_S
1   ALI1             1
2   ALI2             0   # All sites have `NA`, thus, herbivorims is 0 in this region.
3   ALI3             2
4   ALI5             1

I tried this:

as.data.frame(df1 %>% group_by(Region,Herbivory_type) %>% summarise(N = n()))

But the output is not what I desire

  Region Herbivory_type N
1   ALI1              S 1
2   ALI1           <NA> 3
3   ALI2           <NA> 3
4   ALI3              S 2
5   ALI3           <NA> 2
6   ALI5              S 1
7   ALI5           <NA> 1

Does anyone know how to do it?

Thanks in advance

CodePudding user response:

You can use count() to sum up !is.na(Herbivory_type) by group, and get the number of non-missing values for each region.

library(dplyr)

df1 %>%
  count(Region, wt = !is.na(Herbivory_type))

# # A tibble: 4 × 2
#   Region   res
#   <chr>  <int>
# 1 ALI1       1
# 2 ALI2       0
# 3 ALI3       2
# 4 ALI5       1

CodePudding user response:

library(dplyr)
df1 %>% 
    group_by(Region) %>%
    summarise(n_Herbivory_S = sum(Herbivory_type %in% c("S")))

(assuming there may be other categories to ignore in the real dataset - otherwise !is.na() is simpler)

CodePudding user response:

You can count non-NAs, i.e.

library(dplyr)

df1 %>% 
 group_by(Region) %>% 
 summarise(res = sum(!is.na(Herbivory_type)))

# A tibble: 4 × 2
  Region   res
  <chr>  <int>
1 ALI1       1
2 ALI2       0
3 ALI3       2
4 ALI5       1
  • Related