Assign Number to Column Based on Index of Next Value in 2nd Column-CodePudding

I have the following dataframe:

df <- data.frame(t = c("h","h","h","a","a","h","a","a","h","a","h","a","a"), time = c(1,1,1,1,1,1,1,1,1,1,1,2,2), 
                 key = c("no", "no", "no","yes","no","no","no","no","yes","yes","no","no","no"), 
                 expected = c(-1,-1,-1,1,-1,1,-1,-1,1,1,0,0,0))

   t time key expected myTest1 myTest2 myTest3
1  h    1  no       -1      -1       1       1
2  h    1  no       -1      -1       1       1
3  h    1  no       -1      -1       1       1
4  a    1 yes        1       1       1      -1
5  a    1  no       -1       1       0      -1
6  h    1  no        1      -1       0       1
7  a    1  no       -1       1       0      -1
8  a    1  no       -1       1       0      -1
9  h    1 yes        1      -1       0       1
10 a    1 yes        1       1       1      -1
11 h    1  no        0      -1       0       1
12 a    2  no        0       1       0      -1
13 a    2  no        0       1       0      -1

I am trying to recreate a column similar to expected. Grouped by the time column, the first condition is to assign 1 in each row that key has a "yes" in it. The other conditions are:

If the row that contains the next "yes" in key also contains "h" in t, assign 1 up until the "yes" row for each row that has "h", and -1 for rows that have "a"
If the row that contains the next "yes" in key also contains "a" in t, assign 1 up until the "yes" row for each row that has "a", and -1 for rows that have "h"
If there are no more "yes" rows in each time section, assign a 0 to that row

I first tried using a nested for loop:

df$myTest1 <- 0
testIdx <- which(df$key %in% "yes")
df$myTest1[testIdx] <- 1
for (i in 1:length(testIdx)) {
  for (j in 1:nrow(df)) {
    df$myTest1[j] <- ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "h", 1,
                                  ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "a", -1,
                                         ifelse(df$t[testIdx[i]] == "a" & df$t[j] == "h",
                                                -1, ifelse(df$t[testIdx[i]] == "a" & 
                                                             df$t[j] == "a", 1, 0))))
  }
}

And this gets the correct values in myTest1 up to and including the first "yes", but gets all rows after incorrect.

I also tried two more methods to create myTest2 and myTest3:

df$myTest2 <- cumsum(c(1, head(df$key == "yes", -1))) %% 2

df <- df %>%
  mutate(myTest3 = case_when(t == "h" ~ 1, #add if next "yes" is also "h" condition
                               t == "a" ~ -1,
                               TRUE ~ 0))

Using case_when() is similar to the ifelse but I can't figure out how to add the other conditions without for loops.

For clarification, the expected column reads as it does because the first "yes" belongs to a row with "a", so all prior "h" rows get -1 while all prior "a" rows get 1. The next "yes" row now has "h" in it, so rows in between the "yes"'s get 1 for "h" and -1 for "a". Row 10 contains a "yes" and is also right after a "yes", so it just gets a 1. Row 11 is the last time = 1 with no "yes"'s after, so it's assigned 0. There are no "yes" rows when time = 2, so all rows there also receive 0.

CodePudding user response：

This may help you.

The magic happens in the na.locf function from the zoo package.

library(magrittr)
library(zoo)

doblock <- function(timeblock) {
  yesrows <- which(timeblock$key == "yes")
  if (length(yesrows) == 0) {
    # no yes rows in timeblock: make all 0
    timeblock$exp2 <- 0
  } else {
    # create a vector of a's and h's against which we need to match the t field
    tomatch <- rep(NA, nrow(timeblock))
    tomatch[yesrows] <- as.character(timeblock$t)[yesrows]
    
    tomatch <- zoo::na.locf(tomatch, fromLast = TRUE)
    
    # now do the matching
    timeblock$exp2 <- 0    # set default as 0 (for those entries after the last 'yes')
    timeblock$exp2[1:length(tomatch)] <-
      mapply(function(t1, t2) {
        if ((t1) == t2) 1 else -1
      }, as.character(timeblock$t[1:length(tomatch)]), tomatch)
  }
  
  timeblock
} 

# split dataframe into blocks for each 'time' and apply function to every time-block
newdf <- 
  lapply(split(df, df$time), doblock) %>% 
  do.call(rbind, .)

Result looks like this, where exp2 is the output of the above function. Matches with your expected field.

     t time key expected exp2
1.1  h    1  no       -1   -1
1.2  h    1  no       -1   -1
1.3  h    1  no       -1   -1
1.4  a    1 yes        1    1
1.5  a    1  no       -1   -1
1.6  h    1  no        1    1
1.7  a    1  no       -1   -1
1.8  a    1  no       -1   -1
1.9  h    1 yes        1    1
1.10 a    1 yes        1    1
1.11 h    1  no        0    0
2.12 a    2  no        0    0
2.13 a    2  no        0    0