I want to do a simple calculation for each row by group, but I need to refer to a previous row that meets certain conditions. I want to create a new variable, results
. For each row in each group, I want to find the closest row above where tag == "Y"
and code
is not NA
. Then, I want to use the value
from that row, and multiply by the value in the current row.
Minimal Example
df <- structure(list(name = c("apples", "apples", "apples", "apples",
"oranges", "oranges", "oranges", "oranges"),
id = 1:8,
tag = c("X", "Y", "Y", "X", "X", "Y", "X", "X"),
code = c(1, 1, NA, 1, NA, 1, NA, NA),
value = c(1, 11, 4, 3, 9, 5, 7, 8)),
class = "data.frame", row.names = c(NA, -8L))
name id tag code value
1 apples 1 X 1 1
2 apples 2 Y 1 11
3 apples 3 Y NA 4
4 apples 4 X 1 3
5 oranges 5 X NA 9
6 oranges 6 Y 1 5
7 oranges 7 X NA 7
8 oranges 8 X NA 8
Expected Output
For example, for row 3, row 2 would be the closest that meets the conditions, so multiply 4 by 11 (to get 44). For row 4, row 3 does not meet the conditions, so we go to row 2, and multiply 3 by 11 (to get 33). And so on.
name id tag code value results
1 apples 1 X 1 1 NA
2 apples 2 Y 1 11 NA
3 apples 3 Y NA 4 44
4 apples 4 X 1 3 33
5 oranges 5 X NA 9 NA
6 oranges 6 Y 1 5 NA
7 oranges 7 X NA 7 35
8 oranges 8 X NA 8 40
I am guessing that I will need to use cumsum
and/or fill
, but unsure how to use it here. I know that if I was preforming a calculation on the previous row, then I could use lag
, but unsure how to search multiple values above. I am open to base R, data.table
, tidyverse
, or other solutions.
CodePudding user response:
df %>%
group_by(name) %>%
mutate(t = na_if(lag(value * (tag == 'Y' & !is.na(code))), 0)) %>%
fill(t) %>%
mutate(results = t * value)
# A tibble: 8 x 7
# Groups: name [2]
name id tag code value t results
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 apples 1 X 1 1 NA NA
2 apples 2 Y 1 11 NA NA
3 apples 3 Y NA 4 11 44
4 apples 4 X 1 3 11 33
5 oranges 5 X NA 9 NA NA
6 oranges 6 Y 1 5 NA NA
7 oranges 7 X NA 7 5 35
8 oranges 8 X NA 8 5 40
CodePudding user response:
I'm adding a second example data set to show the impact of changing the tags (making row 3 valid for multiplying):
df2 <- df
df2$code[3] <- 1
The aim is to filter the data.frame to valid rows to use in the calculating results, join back on the original data.frame, and use fill
to propagate the last valid value. You add one to the id in the joined data.frame because that would the first ID for which the value would be valid to use. If id's are not sequential in the real data, you would need to add a dummy column with the row number.
For sake of showing impact of changing data, define the function:
computeResults <- function(data) {
left_join(
data,
data %>%
filter(tag == "Y" & !is.na(code)) %>%
mutate(id = id 1) %>%
select(name, id, prevVal = value),
by = c("name", "id"),
copy = TRUE
) %>%
group_by(name) %>%
tidyr::fill(prevVal) %>%
mutate(results = value * prevVal) %>%
select(name, id, tag, code, value, results)
}
Original recipe
computeResults(df)
#> # A tibble: 8 x 6
#> # Groups: name [2]
#> name id tag code value results
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 apples 1 X 1 1 NA
#> 2 apples 2 Y 1 11 NA
#> 3 apples 3 Y NA 4 44
#> 4 apples 4 X 1 3 33
#> 5 oranges 5 X NA 9 NA
#> 6 oranges 6 Y 1 5 NA
#> 7 oranges 7 X NA 7 35
#> 8 oranges 8 X NA 8 40
Extra crispy/row 3 changed
computeResults(df2)
#> # A tibble: 8 x 6
#> # Groups: name [2]
#> name id tag code value results
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 apples 1 X 1 1 NA
#> 2 apples 2 Y 1 11 NA
#> 3 apples 3 Y 1 4 44
#> 4 apples 4 X 1 3 12
#> 5 oranges 5 X NA 9 NA
#> 6 oranges 6 Y 1 5 NA
#> 7 oranges 7 X NA 7 35
#> 8 oranges 8 X NA 8 40
CodePudding user response:
with data.table
:
library(data.table)
setDT(df)
df[,result:=value*shift(nafill(fifelse(tag=='Y'&!is.na(code),value,NA),type = 'locf')),
by=name][]
name id tag code value result
<char> <int> <char> <num> <num> <num>
1: apples 1 X 1 1 NA
2: apples 2 Y 1 11 NA
3: apples 3 Y NA 4 44
4: apples 4 X 1 3 33
5: oranges 5 X NA 9 NA
6: oranges 6 Y 1 5 NA
7: oranges 7 X NA 7 35
8: oranges 8 X NA 8 40
CodePudding user response:
Here's a way in dplyr
, using, as you suspected, a cumulative function, cumany
.
library(dplyr)
df %>%
group_by(name) %>%
mutate(cum = cumsum(tag == "Y" & complete.cases(code))) %>%
group_by(name, cum) %>%
mutate(results = case_when(lag(cum) == cum & cum != 0 ~ value*first(value[cum == cum])))
name id tag code value cum results
<chr> <int> <chr> <dbl> <dbl> <lgl> <dbl>
1 apples 1 X 1 1 FALSE NA
2 apples 2 Y 1 11 TRUE NA
3 apples 3 Y NA 4 TRUE 44
4 apples 4 X 1 3 TRUE 33
5 oranges 5 X NA 9 FALSE NA
6 oranges 6 Y 1 5 TRUE NA
7 oranges 7 X NA 7 TRUE 35
8 oranges 8 X NA 8 TRUE 40
CodePudding user response:
I guess a base R approach could be:
df1<-df
df1$results<-NA
logi<-df1$tag=="Y" & is.na(df1$code)==FALSE
for (i in 1:length(logi)){
if(i == 1 & logi[i] == FALSE){
}else{
if(logi[i] == FALSE & logi[i-1]==TRUE & logi[i 1]==FALSE){
df1$results[i]<-df1$value[i]*df1$value[i-1]
df1$results[i 1]<-df1$value[i 1]*df1$value[i-1]
}
}
}
> df1
name id tag code value results
1 apples 1 X 1 1 NA
2 apples 2 Y 1 11 NA
3 apples 3 Y NA 4 44
4 apples 4 X 1 3 33
5 oranges 5 X NA 9 NA
6 oranges 6 Y 1 5 NA
7 oranges 7 X NA 7 35
8 oranges 8 X NA 8 40