Home > front end >  Create a new column based on existing multiple columns and rows in R
Create a new column based on existing multiple columns and rows in R

Time:11-04

I have been struggling to create a new column based on the values of certain rows and columns in another dataframe. I have a data frame that looks like this:

sample_ID date test_result
sample1 1/1/2022 positive
sample1 1/1/2022 negative
sample1 1/1/2022 negative
sample2 2/1/2022 positive
sample2 3/1/2022 negative
sample3 4/1/2022 negative
sample3 5/1/2022 positive
sample4 5/1/2022 negative
sample4 6/1/2022 negative
sample4 7/1/2022 negative

I want to create a new column with a decision of the final result of each sample ID. If sample is positive at any date, the final result will be the test result of the earliest date of positivity, otherwise the sample is negative. The results should look like this:

sample_ID date test_result final_result
sample1 1/1/2022 positive positive
sample1 1/1/2022 negative positive
sample1 1/1/2022 negative positive
sample2 2/1/2022 positive positive
sample2 3/1/2022 negative positive
sample3 4/1/2022 negative positive
sample3 5/1/2022 positive positive
sample4 5/1/2022 negative negative
sample4 6/1/2022 negative negative
sample4 7/1/2022 negative negative

I did try with ifelse and loop but it was not successful. I would appreciate any help. Thank you very much.

CodePudding user response:

You can use the library dplyr for this, doing a group_by on sample_ID and checking if any of the corresponding values was positive:

library(dplyr)

data = "sample_ID   date    test_result
                 sample1    1/1/2022    positive
                 sample1    1/1/2022    negative
                 sample1    1/1/2022    negative
                 sample2    2/1/2022    positive
                 sample2    3/1/2022    negative
                 sample3    4/1/2022    negative
                 sample3    5/1/2022    positive
                 sample4    5/1/2022    negative
                 sample4    6/1/2022    negative
                 sample4    7/1/2022    negative"

df <- read.table(text=data, sep="\t", header = TRUE)

df %>% 
  group_by(sample_ID) %>% 
  mutate(final_result = ifelse(any(test_result == 'positive'),'positive','negative')) %>% 
  ungroup()

CodePudding user response:

You can group_by sample_ID and check whether any of the test_result is "positive":

library(dplyr)
df %>% 
  group_by(sample_ID) %>% 
  mutate(final_result = ifelse(any(test_result == "positive"), "positive", "negative"))

output

# A tibble: 10 × 4
# Groups:   sample_ID [4]
   sample_ID date     test_result final_result
   <chr>     <chr>    <chr>       <chr>       
 1 sample1   1/1/2022 positive    positive    
 2 sample1   1/1/2022 negative    positive    
 3 sample1   1/1/2022 negative    positive    
 4 sample2   2/1/2022 positive    positive    
 5 sample2   3/1/2022 negative    positive    
 6 sample3   4/1/2022 negative    positive    
 7 sample3   5/1/2022 positive    positive    
 8 sample4   5/1/2022 negative    negative    
 9 sample4   6/1/2022 negative    negative    
10 sample4   7/1/2022 negative    negative    

An equivalent in base R is:

df |>
  transform(final_result = ave(test_result, sample_ID, 
                               FUN = \(x) ifelse(any(x == "positive"), "positive", "negative")))
  • Related