Home > Enterprise >  How to summarize in R the number of first occurrences of a character string in a dataframe column?
How to summarize in R the number of first occurrences of a character string in a dataframe column?

Time:11-24

I am trying to figure out a fast way to calculate the number of "first times" a specified character appears in a dataframe column, by groups. In this example, I am trying to summarize (sum) the number of first times, for each Period, the State of "X" appears, grouped by ID. I am looking for a fast way to process this because it is to be run against a database of several million rows. Maybe there is a good solution using the data.table package?

Immediately below I illustrate what I am trying to achieve, and at the bottom I post the code for the dataframe called testDF.

enter image description here

Code:

testDF <-
  data.frame(
    ID = c(rep(10,5),rep(50,5),rep(60,5)),
    Period = c(1:5,1:5,1:5),
    State = c("A","B","X","X","X",
              "A","A","A","A","A",
              "A","X","A","X","B")
  )

CodePudding user response:

Maybe we can group by 'ID' first and then create the column and then do a group by 'period' and summarise

library(dplyr)
testDF %>% 
  group_by(ID) %>% 
  mutate(`1stStateX` = row_number() == which(State == "X")[1]) %>% 
  group_by(Period) %>%
  summarise(`1stStateX` = sum(`1stStateX`, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 5 × 2
  Period `1stStateX`
   <int>       <int>
1      1           0
2      2           1
3      3           1
4      4           0
5      5           0

Another option will be to slice after grouping by 'ID', get the count and use complete to fill the 'Period' not available

library(tidyr)
testDF %>% 
  group_by(ID) %>%
  slice(match('X', State)) %>%
  ungroup %>% 
  count(Period, sort = TRUE ,name = "1stStateX") %>% 
  complete(Period = unique(testDF$Period), 
    fill = list(`1stStateX` = 0))

-output

# A tibble: 5 × 2
  Period `1stStateX`
   <int>       <int>
1      1           0
2      2           1
3      3           1
4      4           0
5      5           0

Or similar option in data.table

library(data.table)
setDT(testDF)[, `1stStateX` := .I == .I[State == 'X'][1], 
  ID][, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]

-output

 Period 1stStateX
    <int>     <int>
1:      1         0
2:      2         1
3:      3         1
4:      4         0
5:      5         0
  • Related