Home > Net >  Count string by group (R)
Count string by group (R)

Time:10-04

In annual grouping, I would like to get the number of times a string appears in multiple variables (columns).

year <- c("1993", "1994", "1995")
var1 <- c("tardigrades are usually about 0.5 mm long when fully grown.", "slow steppers", "easy") 
var2 <- c("something", "polar bear", "tardigrades are prevalent in mosses and lichens and feed on plant cells")
var3 <- c("kleiner wasserbaer", "newly hatched", "happy learning")
tardigrades <- data.frame(year, var1, var2, var3)
      
      
count_year <- tardigrades %>%
group_by(year) %>%
summarize(count = sum(str_count(tardigrades, 'tardigrades')))

Unfortunately, the total sum is added to each year with this solution. What am I doing wrong?

CodePudding user response:

You should (almost) never use the original frame (tardigrades) in a dplyr pipe. If you want to operate on most or all columns, then you need to be using some aggregating or iterating function and be explicit about the columns (e.g., everything() in tidyselect-speak).

Two suggestions for how to approach this:

  1. Pivot and summar

    library(dplyr)
    library(tidyr) # pivot_longer
    pivot_longer(tardigrades, -year) %>%
      group_by(year) %>%
      summarize(count = sum(grepl("tardigrades", value)))
    # # A tibble: 3 x 2
    #   year  count
    #   <chr> <int>
    # 1 1993      1
    # 2 1994      0
    # 3 1995      1
    
  2. Sum across the columns, this must be done rowwise (and not by year):

    tardigrades %>%
      rowwise() %>%
      summarize(
        year,
        count = sum(grepl("tardigrades", c_across(-year)))
        .groups = "drop")
    # # A tibble: 3 x 2
    #   year  count
    #   <chr> <int>
    # 1 1993      1
    # 2 1994      0
    # 3 1995      1
    
  • Related