In annual grouping, I would like to get the number of times a string appears in multiple variables (columns).
year <- c("1993", "1994", "1995")
var1 <- c("tardigrades are usually about 0.5 mm long when fully grown.", "slow steppers", "easy")
var2 <- c("something", "polar bear", "tardigrades are prevalent in mosses and lichens and feed on plant cells")
var3 <- c("kleiner wasserbaer", "newly hatched", "happy learning")
tardigrades <- data.frame(year, var1, var2, var3)
count_year <- tardigrades %>%
group_by(year) %>%
summarize(count = sum(str_count(tardigrades, 'tardigrades')))
Unfortunately, the total sum is added to each year with this solution. What am I doing wrong?
CodePudding user response:
You should (almost) never use the original frame (tardigrades
) in a dplyr pipe. If you want to operate on most or all columns, then you need to be using some aggregating or iterating function and be explicit about the columns (e.g., everything()
in tidyselect-speak).
Two suggestions for how to approach this:
Pivot and summar
library(dplyr) library(tidyr) # pivot_longer pivot_longer(tardigrades, -year) %>% group_by(year) %>% summarize(count = sum(grepl("tardigrades", value))) # # A tibble: 3 x 2 # year count # <chr> <int> # 1 1993 1 # 2 1994 0 # 3 1995 1
Sum across the columns, this must be done
rowwise
(and not byyear
):tardigrades %>% rowwise() %>% summarize( year, count = sum(grepl("tardigrades", c_across(-year))) .groups = "drop") # # A tibble: 3 x 2 # year count # <chr> <int> # 1 1993 1 # 2 1994 0 # 3 1995 1