More efficient way to compute mean for subset-CodePudding

In this dataframe:

df <- data.frame(
  comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
  hbr = c(101:110)
)

let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:

library(dplyr)
library(data.table)
df %>%
  mutate(grp = rleid(comp)) %>%
  summarise(M = mean(hbr[grp==2]))
      M
1 103.5

CodePudding user response：

In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.

idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5

mean(df$hbr[first])
#[1] 103.5

CodePudding user response：

I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':

df %>%
  mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
  filter(grp == min(grp, na.rm = TRUE)) %>%
  summarise(M = mean(hbr))
#>       M
#> 1 103.5