I have been struggling to create a new column based on the values of certain rows and columns in another dataframe. I have a data frame that looks like this:
sample_ID | date | test_result |
---|---|---|
sample1 | 1/1/2022 | positive |
sample1 | 1/1/2022 | negative |
sample1 | 1/1/2022 | negative |
sample2 | 2/1/2022 | positive |
sample2 | 3/1/2022 | negative |
sample3 | 4/1/2022 | negative |
sample3 | 5/1/2022 | positive |
sample4 | 5/1/2022 | negative |
sample4 | 6/1/2022 | negative |
sample4 | 7/1/2022 | negative |
I want to create a new column with a decision of the final result of each sample ID. If sample is positive at any date, the final result will be the test result of the earliest date of positivity, otherwise the sample is negative. The results should look like this:
sample_ID | date | test_result | final_result |
---|---|---|---|
sample1 | 1/1/2022 | positive | positive |
sample1 | 1/1/2022 | negative | positive |
sample1 | 1/1/2022 | negative | positive |
sample2 | 2/1/2022 | positive | positive |
sample2 | 3/1/2022 | negative | positive |
sample3 | 4/1/2022 | negative | positive |
sample3 | 5/1/2022 | positive | positive |
sample4 | 5/1/2022 | negative | negative |
sample4 | 6/1/2022 | negative | negative |
sample4 | 7/1/2022 | negative | negative |
I did try with ifelse and loop but it was not successful. I would appreciate any help. Thank you very much.
CodePudding user response:
You can use the library dplyr for this, doing a group_by on sample_ID and checking if any of the corresponding values was positive:
library(dplyr)
data = "sample_ID date test_result
sample1 1/1/2022 positive
sample1 1/1/2022 negative
sample1 1/1/2022 negative
sample2 2/1/2022 positive
sample2 3/1/2022 negative
sample3 4/1/2022 negative
sample3 5/1/2022 positive
sample4 5/1/2022 negative
sample4 6/1/2022 negative
sample4 7/1/2022 negative"
df <- read.table(text=data, sep="\t", header = TRUE)
df %>%
group_by(sample_ID) %>%
mutate(final_result = ifelse(any(test_result == 'positive'),'positive','negative')) %>%
ungroup()
CodePudding user response:
You can group_by
sample_ID
and check whether any of the test_result
is "positive"
:
library(dplyr)
df %>%
group_by(sample_ID) %>%
mutate(final_result = ifelse(any(test_result == "positive"), "positive", "negative"))
output
# A tibble: 10 × 4
# Groups: sample_ID [4]
sample_ID date test_result final_result
<chr> <chr> <chr> <chr>
1 sample1 1/1/2022 positive positive
2 sample1 1/1/2022 negative positive
3 sample1 1/1/2022 negative positive
4 sample2 2/1/2022 positive positive
5 sample2 3/1/2022 negative positive
6 sample3 4/1/2022 negative positive
7 sample3 5/1/2022 positive positive
8 sample4 5/1/2022 negative negative
9 sample4 6/1/2022 negative negative
10 sample4 7/1/2022 negative negative
An equivalent in base R is:
df |>
transform(final_result = ave(test_result, sample_ID,
FUN = \(x) ifelse(any(x == "positive"), "positive", "negative")))