I have a dataframe as follows:
ID | Col1 | RespID | Col3 | Col4 |
---|---|---|---|---|
1 | blue | 729Ad | 3.2 | A |
2 | orange | 295gS | 6.5 | A |
3 | red | 729Ad | 8.4 | B |
4 | yellow | 592Jd | 2.9 | A |
5 | green | 937sa | 3.5 | B |
I would like to calculate a new column, Col5, such that its value is 1 if the row has Col4 value of A and there exists another column somewhere in the dataset a row with the same RespId but a Col4 value of B. Otherwise it’s value is 0. Then I will drop all rows with Col4 value of B, to keep just those with A. I would like to do this using the R tidymodels recipe package. I’d also like to do this all with data frames.
Here is what the desired output table would look like prior to dropping rows with Col4 value of B:
ID | Col1 | RespID | Col3 | Col4 | Col5 |
---|---|---|---|---|---|
1 | blue | 729Ad | 3.2 | A | 1 |
2 | orange | 295gS | 6.5 | A | 0 |
3 | red | 729Ad | 8.4 | B | 0 |
4 | yellow | 592Jd | 2.9 | A | 0 |
5 | green | 937sa | 3.5 | B | 0 |
CodePudding user response:
Does this work:
library(dplyr)
df %>% group_by(RespID) %>% mutate(col5 = case_when(Col4 == 'A' & last(Col4 == 'B') ~ 1, Col4 == 'B' & first(Col4 == 'B') ~ 0, TRUE ~ 0))
# A tibble: 5 x 6
# Groups: RespID [4]
ID Col1 RespID Col3 Col4 col5
<int> <chr> <chr> <dbl> <chr> <dbl>
1 1 blue 729Ad 3.2 A 1
2 2 orange 295gS 6.5 A 0
3 3 red 729Ad 8.4 B 0
4 4 yellow 592Jd 2.9 A 0
5 5 green 937sa 3.5 B 0
CodePudding user response:
You may try -
library(dplyr)
df %>%
group_by(RespID) %>%
mutate(Col5 = as.integer(all(c('A', 'B') %in% Col4) & Col4 == 'A')) %>%
ungroup
# ID Col1 RespID Col3 Col4 Col5
# <int> <chr> <chr> <dbl> <chr> <int>
#1 1 blue 729Ad 3.2 A 1
#2 2 orange 295gS 6.5 A 0
#3 3 red 729Ad 8.4 B 0
#4 4 yellow 592Jd 2.9 A 0
#5 5 green 937sa 3.5 B 0
all(c('A', 'B') %in% Col4)
checks that both A
and B
value is present in RespID
whereas Col4 == 'A'
would give 1 only where 'A'
is present.
data
df <- structure(list(ID = 1:5, Col1 = c("blue", "orange", "red", "yellow",
"green"), RespID = c("729Ad", "295gS", "729Ad", "592Jd", "937sa"
), Col3 = c(3.2, 6.5, 8.4, 2.9, 3.5), Col4 = c("A", "A", "B",
"A", "B")), row.names = c(NA, -5L), class = "data.frame")