Home > Mobile >  Find and replace conditional column value by row for a large data frame in R
Find and replace conditional column value by row for a large data frame in R

Time:10-07

All, I have a large genomic data that I want to do find-and-replace based on "REF" and "ALT" column values for each row. The replacement rules are:

  • If the value = 0, replace it with "-/-"
  • If the value = 1, replace it with two REF alleles
  • If the value = 2, replace it with two ALT alleles
  • If the value = 3, replace it with one REF allele and one ALT allele

Here is a small subset of an example data:

Chr Pos REF ALT  A  B  C  D  E
  1  70   A   G  1  0  1  1  1
  1  80   T   G  1  0  3  3  3
  1 100   C   T  1  0  1  1  1
  2  20   G   A  1  0  0  0  1
  2  80   C   T  1  0  0  0  2

The desired output is:

Chr Pos REF ALT    A    B    C    D    E
  1  70   A   G  A/A  -/-  A/A  A/A  A/A
  1  80   T   G  T/T  -/-  T/G  T/G  T/G
  1 100   C   T  C/C  -/-  C/C  C/C  C/C
  2  20   G   A  G/G  -/-  -/-  -/-  G/G
  2  80   C   T  C/C  -/-  -/-  -/-  T/T

Reproducible data frame:

df=data.frame(
  Chr=c(1,1,1,2,2),
  Pos=c(70,80,100,20,80),
  REF=c("A","T","C","G","C"),
  ALT=c("G","G","T","A","T"),
  A=c(1,1,1,1,1),
  B=c(0,0,0,0,0),
  C=c(1,3,1,0,0),
  D=c(1,3,1,0,0),
  E=c(1,3,1,1,2)
)

I wrote a for loop for the task:

K=data.frame()
for (r in 1:nrow(df)){
  k=df[r,]
  ks=df[r,-1:-4]
  ks[ks==0]="-/-"
  ks[ks==1]=paste0(k$REF,"/",k$REF)
  ks[ks==2]=paste0(k$ALT,"/",k$ALT)
  ks[ks==3]=paste0(k$REF,"/",k$ALT)
  ks=cbind(k[,1:4],ks)
  K=rbind(K,ks)
}

That worked okay, however, I have about 150,000 lines of them and the row by row operation takes very long time so I was wondering if there is a faster way to process this?

Thank you very much for the help!

CodePudding user response:

We may use case_when - loop across the columns 'A' to 'E', create the format with sprintf based on the condition created in `case_when

library(dplyr)
df %>% 
   mutate(across(A:E, ~ case_when(. == 0 ~ '-/-', 
                  . == 1 ~ sprintf('%1$s/%1$s', REF),
                  .==2 ~ sprintf('%1$s/%1$s', ALT), 
                   TRUE ~ sprintf('%s/%s', REF, ALT))))

-output

 Chr Pos REF ALT   A   B   C   D   E
1   1  70   A   G A/A -/- A/A A/A A/A
2   1  80   T   G T/T -/- T/G T/G T/G
3   1 100   C   T C/C -/- C/C C/C C/C
4   2  20   G   A G/G -/- -/- -/- G/G
5   2  80   C   T C/C -/- -/- -/- T/T

NOTE: As this is done by column and not by row, it should be much faster than the OP's approach

  •  Tags:  
  • r
  • Related