I am scratching my head why my code, which clearly defines a subset of data, is using a row outside that subset in doing a calculation. Here is an example. My data is all from the same batch ("batch1"), but I want to calculate NPOC using only the first set of rows before the first "B" it encounters. Then NPOC is calculated by subtracting the "B" row's resultC10 value from that row's resultC10 value. So the F row - B row, then U row - B row. All the data is from the same batch, but because I'm defining a subset, why would it even know about the data in the rest of the batch?
dat <- data.frame(sample_ind=c("F","X","B","F","X","B"),
resultC=c(7.31,3.12,.79,7.38,2.28,.59),
batch=c('batch1','batch1','batch1','batch1','batch1','batch1'))
dat$resultC10=dat$resultC*10
dat$NPOC <- NA
start_row = 1
for (i in nrow(dat)) {
if (dat[i,1]=='B') {
dat[start_row:i,] <- dat[start_row:i,] %>%
group_by(batch) %>%
mutate(NPOC = resultC10-resultC10[sample_ind=='B']) %>%
ungroup
start_row = i 1
}
}
Here is the result I'm getting:
sample_ind resultC batch resultC10 NPOC
1 F 7.31 batch1 73.1 65.2 **NPOC OK-using row 3 (73.1-7.9)
2 X 3.12 batch1 31.2 25.3 **NPOC should be 23.3; it's using row 6 (31.2-5.9)
3 B 0.79 batch1 7.9 0.0
4 F 7.38 batch1 73.8 67.9 **OK-using row 6
5 X 2.28 batch1 22.8 14.9 **should be 16.9; it's using row 3
6 B 0.59 batch1 5.9 0.0
Any help is greatly appreciated.
CodePudding user response:
You can achieve this without having to use for-loops:
dat <- data.frame(sample_ind=c("F","X","B","F","X","B"),
resultC=c(7.31,3.12,.79,7.38,2.28,.59),
batch=c('batch1','batch1','batch1','batch1','batch1','batch1'))
dat$resultC10=dat$resultC*10
dat %>%
group_by(lag(cumsum(sample_ind == "B"), default = 0)) %>%
mutate(NPOC = resultC10-last(resultC10)) %>%
ungroup() %>%
select(-`lag(cumsum(sample_ind == "B"), default = 0)`)