Find closest value above current row that meets condition and apply function for each group-CodePudding

I want to do a simple calculation for each row by group, but I need to refer to a previous row that meets certain conditions. I want to create a new variable, results. For each row in each group, I want to find the closest row above where tag == "Y" and code is not NA. Then, I want to use the value from that row, and multiply by the value in the current row.

Minimal Example

df <- structure(list(name = c("apples", "apples", "apples", "apples", 
                              "oranges", "oranges", "oranges", "oranges"), 
               id = 1:8, 
               tag = c("X", "Y", "Y", "X", "X", "Y", "X", "X"), 
               code = c(1, 1, NA, 1, NA, 1, NA, NA),
               value = c(1, 11, 4, 3, 9, 5, 7, 8)), 
          class = "data.frame", row.names = c(NA, -8L))

     name id tag code value
1  apples  1   X    1     1
2  apples  2   Y    1    11
3  apples  3   Y   NA     4
4  apples  4   X    1     3
5 oranges  5   X   NA     9
6 oranges  6   Y    1     5
7 oranges  7   X   NA     7
8 oranges  8   X   NA     8

Expected Output

For example, for row 3, row 2 would be the closest that meets the conditions, so multiply 4 by 11 (to get 44). For row 4, row 3 does not meet the conditions, so we go to row 2, and multiply 3 by 11 (to get 33). And so on.

     name id tag code value results
1  apples  1   X    1     1      NA
2  apples  2   Y    1    11      NA
3  apples  3   Y   NA     4      44
4  apples  4   X    1     3      33
5 oranges  5   X   NA     9      NA
6 oranges  6   Y    1     5      NA
7 oranges  7   X   NA     7      35
8 oranges  8   X   NA     8      40

I am guessing that I will need to use cumsum and/or fill, but unsure how to use it here. I know that if I was preforming a calculation on the previous row, then I could use lag, but unsure how to search multiple values above. I am open to base R, data.table, tidyverse, or other solutions.

CodePudding user response：

df %>% 
  group_by(name) %>%
  mutate(t = na_if(lag(value * (tag == 'Y' & !is.na(code))), 0)) %>%
  fill(t) %>%
  mutate(results = t * value)

# A tibble: 8 x 7
# Groups:   name [2]
  name       id tag    code value     t results
  <chr>   <int> <chr> <dbl> <dbl> <dbl>   <dbl>
1 apples      1 X         1     1    NA      NA
2 apples      2 Y         1    11    NA      NA
3 apples      3 Y        NA     4    11      44
4 apples      4 X         1     3    11      33
5 oranges     5 X        NA     9    NA      NA
6 oranges     6 Y         1     5    NA      NA
7 oranges     7 X        NA     7     5      35
8 oranges     8 X        NA     8     5      40

CodePudding user response：

I'm adding a second example data set to show the impact of changing the tags (making row 3 valid for multiplying):

df2 <- df
df2$code[3] <- 1

The aim is to filter the data.frame to valid rows to use in the calculating results, join back on the original data.frame, and use fill to propagate the last valid value. You add one to the id in the joined data.frame because that would the first ID for which the value would be valid to use. If id's are not sequential in the real data, you would need to add a dummy column with the row number.

For sake of showing impact of changing data, define the function:

computeResults <- function(data) {
  left_join(
      data,
      data %>% 
        filter(tag == "Y" & !is.na(code)) %>% 
        mutate(id = id   1) %>% 
        select(name, id, prevVal = value),
      by = c("name", "id"),
      copy = TRUE
    ) %>% 
    group_by(name) %>% 
    tidyr::fill(prevVal) %>% 
    mutate(results = value * prevVal) %>% 
    select(name, id, tag, code, value, results)
}

Original recipe

computeResults(df)
#> # A tibble: 8 x 6
#> # Groups:   name [2]
#>   name       id tag    code value results
#>   <chr>   <dbl> <chr> <dbl> <dbl>   <dbl>
#> 1 apples      1 X         1     1      NA
#> 2 apples      2 Y         1    11      NA
#> 3 apples      3 Y        NA     4      44
#> 4 apples      4 X         1     3      33
#> 5 oranges     5 X        NA     9      NA
#> 6 oranges     6 Y         1     5      NA
#> 7 oranges     7 X        NA     7      35
#> 8 oranges     8 X        NA     8      40

Extra crispy/row 3 changed

computeResults(df2)
#> # A tibble: 8 x 6
#> # Groups:   name [2]
#>   name       id tag    code value results
#>   <chr>   <dbl> <chr> <dbl> <dbl>   <dbl>
#> 1 apples      1 X         1     1      NA
#> 2 apples      2 Y         1    11      NA
#> 3 apples      3 Y         1     4      44
#> 4 apples      4 X         1     3      12
#> 5 oranges     5 X        NA     9      NA
#> 6 oranges     6 Y         1     5      NA
#> 7 oranges     7 X        NA     7      35
#> 8 oranges     8 X        NA     8      40

CodePudding user response：

with data.table:

library(data.table)
setDT(df)

df[,result:=value*shift(nafill(fifelse(tag=='Y'&!is.na(code),value,NA),type = 'locf')), 
   by=name][]

      name    id    tag  code value result
    <char> <int> <char> <num> <num>  <num>
1:  apples     1      X     1     1     NA
2:  apples     2      Y     1    11     NA
3:  apples     3      Y    NA     4     44
4:  apples     4      X     1     3     33
5: oranges     5      X    NA     9     NA
6: oranges     6      Y     1     5     NA
7: oranges     7      X    NA     7     35
8: oranges     8      X    NA     8     40

CodePudding user response：

Here's a way in dplyr, using, as you suspected, a cumulative function, cumany.

library(dplyr)
df %>% 
  group_by(name) %>% 
  mutate(cum = cumsum(tag == "Y" & complete.cases(code))) %>% 
  group_by(name, cum) %>% 
  mutate(results = case_when(lag(cum) == cum & cum != 0 ~ value*first(value[cum == cum])))

  name       id tag    code value cum   results
  <chr>   <int> <chr> <dbl> <dbl> <lgl>   <dbl>
1 apples      1 X         1     1 FALSE      NA
2 apples      2 Y         1    11 TRUE       NA
3 apples      3 Y        NA     4 TRUE       44
4 apples      4 X         1     3 TRUE       33
5 oranges     5 X        NA     9 FALSE      NA
6 oranges     6 Y         1     5 TRUE       NA
7 oranges     7 X        NA     7 TRUE       35
8 oranges     8 X        NA     8 TRUE       40

CodePudding user response：

I guess a base R approach could be:

df1<-df
df1$results<-NA

logi<-df1$tag=="Y" & is.na(df1$code)==FALSE

for (i in 1:length(logi)){
  
  if(i == 1 & logi[i] == FALSE){
    }else{
      
      if(logi[i] == FALSE & logi[i-1]==TRUE & logi[i 1]==FALSE){
        
        df1$results[i]<-df1$value[i]*df1$value[i-1]
        
        df1$results[i 1]<-df1$value[i 1]*df1$value[i-1]
      }
    }
  }


> df1
     name id tag code value results
1  apples  1   X    1     1      NA
2  apples  2   Y    1    11      NA
3  apples  3   Y   NA     4      44
4  apples  4   X    1     3      33
5 oranges  5   X   NA     9      NA
6 oranges  6   Y    1     5      NA
7 oranges  7   X   NA     7      35
8 oranges  8   X   NA     8      40