Manipulating data within a subset is using data outside the subset(??)-CodePudding

I am scratching my head why my code, which clearly defines a subset of data, is using a row outside that subset in doing a calculation. Here is an example. My data is all from the same batch ("batch1"), but I want to calculate NPOC using only the first set of rows before the first "B" it encounters. Then NPOC is calculated by subtracting the "B" row's resultC10 value from that row's resultC10 value. So the F row - B row, then U row - B row. All the data is from the same batch, but because I'm defining a subset, why would it even know about the data in the rest of the batch?

dat <- data.frame(sample_ind=c("F","X","B","F","X","B"),
                  resultC=c(7.31,3.12,.79,7.38,2.28,.59),
                  batch=c('batch1','batch1','batch1','batch1','batch1','batch1'))
dat$resultC10=dat$resultC*10
dat$NPOC <- NA
         
start_row = 1

for (i in nrow(dat)) {
  if (dat[i,1]=='B') {
    dat[start_row:i,] <- dat[start_row:i,] %>%
      group_by(batch) %>%
      mutate(NPOC = resultC10-resultC10[sample_ind=='B']) %>%
      ungroup
    start_row = i 1
  }
}

Here is the result I'm getting:

  sample_ind resultC  batch resultC10 NPOC
1          F    7.31 batch1      73.1 65.2  **NPOC OK-using row 3 (73.1-7.9)
2          X    3.12 batch1      31.2 25.3  **NPOC should be 23.3; it's using row 6 (31.2-5.9)
3          B    0.79 batch1       7.9  0.0  
4          F    7.38 batch1      73.8 67.9  **OK-using row 6
5          X    2.28 batch1      22.8 14.9  **should be 16.9; it's using row 3
6          B    0.59 batch1       5.9  0.0

Any help is greatly appreciated.

CodePudding user response：

You can achieve this without having to use for-loops:

dat <- data.frame(sample_ind=c("F","X","B","F","X","B"),
                  resultC=c(7.31,3.12,.79,7.38,2.28,.59),
                  batch=c('batch1','batch1','batch1','batch1','batch1','batch1'))
dat$resultC10=dat$resultC*10

dat %>%
  group_by(lag(cumsum(sample_ind == "B"), default = 0)) %>%
  mutate(NPOC = resultC10-last(resultC10)) %>%
  ungroup() %>%
  select(-`lag(cumsum(sample_ind == "B"), default = 0)`)