I have the following dataframe:
df <- data.frame(t = c("h","h","h","a","a","h","a","a","h","a","h","a","a"), time = c(1,1,1,1,1,1,1,1,1,1,1,2,2),
key = c("no", "no", "no","yes","no","no","no","no","yes","yes","no","no","no"),
expected = c(-1,-1,-1,1,-1,1,-1,-1,1,1,0,0,0))
t time key expected myTest1 myTest2 myTest3
1 h 1 no -1 -1 1 1
2 h 1 no -1 -1 1 1
3 h 1 no -1 -1 1 1
4 a 1 yes 1 1 1 -1
5 a 1 no -1 1 0 -1
6 h 1 no 1 -1 0 1
7 a 1 no -1 1 0 -1
8 a 1 no -1 1 0 -1
9 h 1 yes 1 -1 0 1
10 a 1 yes 1 1 1 -1
11 h 1 no 0 -1 0 1
12 a 2 no 0 1 0 -1
13 a 2 no 0 1 0 -1
I am trying to recreate a column similar to expected
. Grouped by the time
column, the first condition is to assign 1
in each row that key
has a "yes" in it. The other conditions are:
- If the row that contains the next "yes" in
key
also contains "h" int
, assign1
up until the "yes" row for each row that has "h", and-1
for rows that have "a" - If the row that contains the next "yes" in
key
also contains "a" int
, assign1
up until the "yes" row for each row that has "a", and-1
for rows that have "h" - If there are no more "yes" rows in each
time
section, assign a0
to that row
I first tried using a nested for loop:
df$myTest1 <- 0
testIdx <- which(df$key %in% "yes")
df$myTest1[testIdx] <- 1
for (i in 1:length(testIdx)) {
for (j in 1:nrow(df)) {
df$myTest1[j] <- ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "h", 1,
ifelse(df$t[testIdx[i]] == "h" & df$t[j] == "a", -1,
ifelse(df$t[testIdx[i]] == "a" & df$t[j] == "h",
-1, ifelse(df$t[testIdx[i]] == "a" &
df$t[j] == "a", 1, 0))))
}
}
And this gets the correct values in myTest1
up to and including the first "yes", but gets all rows after incorrect.
I also tried two more methods to create myTest2
and myTest3
:
df$myTest2 <- cumsum(c(1, head(df$key == "yes", -1))) %% 2
df <- df %>%
mutate(myTest3 = case_when(t == "h" ~ 1, #add if next "yes" is also "h" condition
t == "a" ~ -1,
TRUE ~ 0))
Using case_when()
is similar to the ifelse
but I can't figure out how to add the other conditions without for loops.
For clarification, the expected
column reads as it does because the first "yes" belongs to a row with "a", so all prior "h" rows get -1
while all prior "a" rows get 1
. The next "yes" row now has "h" in it, so rows in between the "yes"'s get 1
for "h" and -1
for "a". Row 10 contains a "yes" and is also right after a "yes", so it just gets a 1
. Row 11 is the last time
= 1 with no "yes"'s after, so it's assigned 0
. There are no "yes" rows when time
= 2, so all rows there also receive 0
.
CodePudding user response:
This may help you.
The magic happens in the na.locf
function from the zoo
package.
library(magrittr)
library(zoo)
doblock <- function(timeblock) {
yesrows <- which(timeblock$key == "yes")
if (length(yesrows) == 0) {
# no yes rows in timeblock: make all 0
timeblock$exp2 <- 0
} else {
# create a vector of a's and h's against which we need to match the t field
tomatch <- rep(NA, nrow(timeblock))
tomatch[yesrows] <- as.character(timeblock$t)[yesrows]
tomatch <- zoo::na.locf(tomatch, fromLast = TRUE)
# now do the matching
timeblock$exp2 <- 0 # set default as 0 (for those entries after the last 'yes')
timeblock$exp2[1:length(tomatch)] <-
mapply(function(t1, t2) {
if ((t1) == t2) 1 else -1
}, as.character(timeblock$t[1:length(tomatch)]), tomatch)
}
timeblock
}
# split dataframe into blocks for each 'time' and apply function to every time-block
newdf <-
lapply(split(df, df$time), doblock) %>%
do.call(rbind, .)
Result looks like this, where exp2
is the output of the above function. Matches with your expected
field.
t time key expected exp2
1.1 h 1 no -1 -1
1.2 h 1 no -1 -1
1.3 h 1 no -1 -1
1.4 a 1 yes 1 1
1.5 a 1 no -1 -1
1.6 h 1 no 1 1
1.7 a 1 no -1 -1
1.8 a 1 no -1 -1
1.9 h 1 yes 1 1
1.10 a 1 yes 1 1
1.11 h 1 no 0 0
2.12 a 2 no 0 0
2.13 a 2 no 0 0