Home > Software design >  Check to see if value from one column is present in two other columns in one dataframe R
Check to see if value from one column is present in two other columns in one dataframe R

Time:12-02

I'd like to figure out a way to compare columns in the SAME data frame, but in such a way that I create a new column called STATUS for the output. I have 3 columns 1)SNPs, 2)gained, and 3)lost. I want to know if the data in each cell in column 1 is present in either column 2 or 3. If the data from column 1 is present in column 2 then I would want the output to say GAINED, and if its present in column 3 then the output would be LOST. If it's present in either then the output will be NEUTRAL.

Here is what I would like:

SNPs         GAINED          LOST           STATUS
1_752566     1_949654        6_30022061     NEUTRAL
1_776546     1_1045331       6_30314321     NEUTRAL
1_832918     1_832918        13_95612033    GAINED
1_914852     1_1247494       1_914852       LOST

I've tried this:

data_frame$status <- data.frame(lapply(data_frame[1], `%in%`, data_frame[2:3]))

but it produces 2 columns that all say NEUTRAL. I believe it's reading per row to see if it matches, but my data isn't organized in that manner such that it will find every match per row. Instead I'd like to search the entire column and have R find the matches in each cell instead of searching per row.

CodePudding user response:

You don't need lapply or anything fancy like that.

data_frame$STATUS = with(data_frame,
  ifelse(SNPs %in% GAINED, "GAINED",
   ifelse(SNPs %in% LOST, "LOST", "NEUTRAL")
  )
)

Note that the way this is written the GAINED condition is checked first so if it is present in both GAINED and LOST the result will be "GAINED".

CodePudding user response:

Using a nested ifelse should work, and be fairly understandable if indented properly:

tbl$status <- ifelse(tbl$SNPs %in% tbl$GAINED, "GAINED",
                               ifelse(tbl$SNPs %in% tbl$LOST, "LOST", "NEUTRAL") )

> tbl
      SNPs    GAINED        LOST  STATUS  status
1 1_752566  1_949654  6_30022061 NEUTRAL NEUTRAL
2 1_776546 1_1045331  6_30314321 NEUTRAL NEUTRAL
3 1_832918  1_832918 13_95612033  GAINED  GAINED
4 1_914852 1_1247494    1_914852    LOST    LOST

CodePudding user response:

A Tidyverse approach with case_when

library(tidyverse)

df <-
  structure(
    list(
      SNPs = c("1_752566", "1_776546", "1_832918", "1_914852"),
      GAINED = c("1_949654", "1_1045331", "1_832918", "1_1247494"),
      LOST = c("6_30022061", "6_30314321", "13_95612033", "1_914852")
    ),
    row.names = c(NA,-4L),
    spec = structure(list(
      cols = list(
        SNPs = structure(list(), class = c("collector_character",
                                           "collector")),
        GAINED = structure(list(), class = c("collector_character",
                                             "collector")),
        LOST = structure(list(), class = c("collector_character",
                                           "collector"))
      ),
      default = structure(list(), class = c("collector_guess",
                                            "collector")),
      delim = ","
    ), class = "col_spec"),
    class = c("spec_tbl_df",
              "tbl_df", "tbl", "data.frame")
  )

df %>%
  mutate(STATUS = case_when(
    SNPs %in% GAINED ~ 'GAINED',
    SNPs %in% LOST ~ 'LOST',
    TRUE ~ 'NEUTRAL'
  ))
#> # A tibble: 4 × 4
#>   SNPs     GAINED    LOST        STATUS 
#>   <chr>    <chr>     <chr>       <chr>  
#> 1 1_752566 1_949654  6_30022061  NEUTRAL
#> 2 1_776546 1_1045331 6_30314321  NEUTRAL
#> 3 1_832918 1_832918  13_95612033 GAINED 
#> 4 1_914852 1_1247494 1_914852    LOST

Created on 2022-12-01 with reprex v2.0.2

  • Related