Create a new column based on existing multiple columns and rows in R-CodePudding

I have been struggling to create a new column based on the values of certain rows and columns in another dataframe. I have a data frame that looks like this:

sample_ID	date	test_result
sample1	1/1/2022	positive
sample1	1/1/2022	negative
sample1	1/1/2022	negative
sample2	2/1/2022	positive
sample2	3/1/2022	negative
sample3	4/1/2022	negative
sample3	5/1/2022	positive
sample4	5/1/2022	negative
sample4	6/1/2022	negative
sample4	7/1/2022	negative

I want to create a new column with a decision of the final result of each sample ID. If sample is positive at any date, the final result will be the test result of the earliest date of positivity, otherwise the sample is negative. The results should look like this:

sample_ID	date	test_result	final_result
sample1	1/1/2022	positive	positive
sample1	1/1/2022	negative	positive
sample1	1/1/2022	negative	positive
sample2	2/1/2022	positive	positive
sample2	3/1/2022	negative	positive
sample3	4/1/2022	negative	positive
sample3	5/1/2022	positive	positive
sample4	5/1/2022	negative	negative
sample4	6/1/2022	negative	negative
sample4	7/1/2022	negative	negative

I did try with ifelse and loop but it was not successful. I would appreciate any help. Thank you very much.

CodePudding user response：

You can use the library dplyr for this, doing a group_by on sample_ID and checking if any of the corresponding values was positive:

library(dplyr)

data = "sample_ID   date    test_result
                 sample1    1/1/2022    positive
                 sample1    1/1/2022    negative
                 sample1    1/1/2022    negative
                 sample2    2/1/2022    positive
                 sample2    3/1/2022    negative
                 sample3    4/1/2022    negative
                 sample3    5/1/2022    positive
                 sample4    5/1/2022    negative
                 sample4    6/1/2022    negative
                 sample4    7/1/2022    negative"

df <- read.table(text=data, sep="\t", header = TRUE)

df %>% 
  group_by(sample_ID) %>% 
  mutate(final_result = ifelse(any(test_result == 'positive'),'positive','negative')) %>% 
  ungroup()

CodePudding user response：

You can group_by sample_ID and check whether any of the test_result is "positive":

library(dplyr)
df %>% 
  group_by(sample_ID) %>% 
  mutate(final_result = ifelse(any(test_result == "positive"), "positive", "negative"))

output

# A tibble: 10 × 4
# Groups:   sample_ID [4]
   sample_ID date     test_result final_result
   <chr>     <chr>    <chr>       <chr>       
 1 sample1   1/1/2022 positive    positive    
 2 sample1   1/1/2022 negative    positive    
 3 sample1   1/1/2022 negative    positive    
 4 sample2   2/1/2022 positive    positive    
 5 sample2   3/1/2022 negative    positive    
 6 sample3   4/1/2022 negative    positive    
 7 sample3   5/1/2022 positive    positive    
 8 sample4   5/1/2022 negative    negative    
 9 sample4   6/1/2022 negative    negative    
10 sample4   7/1/2022 negative    negative

An equivalent in base R is:

df |>
  transform(final_result = ave(test_result, sample_ID, 
                               FUN = \(x) ifelse(any(x == "positive"), "positive", "negative")))