Home > Net >  Compare data frame multiple columns with row in R
Compare data frame multiple columns with row in R

Time:03-31

I have data frame with multiple columns and rows. I want to compare column number 7 rows to the header of columns 1,2,4 and 5 and if it matches then print the sequence present in that column as new column. The common pattern between the columns and rows is .x and .y

My data frame looks like this

    F_20TP53_Seq.x  F_30TP53_Seq.x  R_20TP53_Seq.x  F_20TP53_Seq.y  F_30TP53_Seq.y  R_20TP53_Seq.y  Name_of_F_TP53
    CACTGT  CAAAGT  CATAGT  AATGTTG CACAGT  CAAAGT  F_20TP53_Max_score.y
    CACAGT  CACTGT  CACAGT  CCAAGG  CATAGT  CACTGT  F_30TP53_Max_score.y
    CATAGT  AATGTTG CACAG   GCCAGG  CACAGT  CACTGT  F_20TP53_Max_score.x
    CACAGT  CCAAGG  CACCAT  CAAAGT  CACAG   CACAGT  F_30TP53_Max_score.x
    CACTGT  CACAGT  CCAAGG  CACTGT  CACCAT  CATAGT  F_30TP53_Max_score.y

And my expected output is like this

    F_20TP53_Seq.x  F_30TP53_Seq.x  R_20TP53_Seq.x  F_20TP53_Seq.y  F_30TP53_Seq.y  R_20TP53_Seq.y  Name_of_F_TP53  F_20TP53_Seq.x  F_30TP53_Seq.x  F_20TP53_Seq.y  F_30TP53_Seq.y
    CACTGT  CAAAGT  CATAGT  AATGTTG CACAGT  CAAAGT  F_20TP53_Max_score.y    NA  NA  AATGTTG CACAGT
    CACAGT  CACTGT  CACAGT  CCAAGG  CATAGT  CACTGT  F_30TP53_Max_score.y    NA  NA  CCAAGG  CATAGT
    CATAGT  AATGTTG CACAG   GCCAGG  CACAGT  CACTGT  F_20TP53_Max_score.x    CATAGT  AATGTTG NA  NA
    CACAGT  CCAAGG  CACCAT  CAAAGT  CACAG   CACAGT  F_30TP53_Max_score.x    CACAGT  CCAAGG  NA  NA
    CACTGT  CACAGT  CCAAGG  CACTGT  CACCAT  CATAGT  F_30TP53_Max_score.y    NA  NA  CACTGT  CACCAT

CodePudding user response:

I use stringr package below to extract a logical vector as to whether or not there is a match in the target column

library(stringr)

cbind(
  d,
  setNames(
    lapply(c(1,2,4,5), function(x) {
      key = paste0(str_extract(colnames(d)[x],"x|y"),"$")
      k <- str_detect(d$Name_of_F_TP53,key)
      sapply(seq_along(k),function(l) ifelse(k[l],d[l,x],NA))
    }), colnames(d)[c(1,2,4,5)])
)

Output:

  F_20TP53_Seq.x F_30TP53_Seq.x R_20TP53_Seq.x F_20TP53_Seq.y F_30TP53_Seq.y R_20TP53_Seq.y         Name_of_F_TP53
1         CACTGT         CAAAGT         CATAGT        AATGTTG         CACAGT         CAAAGT   F_20TP53_Max_score.y
2         CACAGT         CACTGT         CACAGT         CCAAGG         CATAGT         CACTGT   F_30TP53_Max_score.y
3         CATAGT        AATGTTG          CACAG         GCCAGG         CACAGT         CACTGT   F_20TP53_Max_score.x
4         CACAGT         CCAAGG         CACCAT         CAAAGT          CACAG         CACAGT   F_30TP53_Max_score.x
5         CACTGT         CACAGT         CCAAGG         CACTGT         CACCAT         CATAGT   F_30TP53_Max_score.y
  F_20TP53_Seq.x F_30TP53_Seq.x F_20TP53_Seq.y F_30TP53_Seq.y
1           <NA>           <NA>        AATGTTG         CACAGT
2           <NA>           <NA>         CCAAGG         CATAGT
3         CATAGT        AATGTTG           <NA>           <NA>
4         CACAGT         CCAAGG           <NA>           <NA>
5           <NA>           <NA>         CACTGT         CACCAT
  •  Tags:  
  • r
  • Related