All, I have a large genomic data that I want to do find-and-replace based on "REF" and "ALT" column values for each row. The replacement rules are:
- If the value = 0, replace it with "-/-"
- If the value = 1, replace it with two REF alleles
- If the value = 2, replace it with two ALT alleles
- If the value = 3, replace it with one REF allele and one ALT allele
Here is a small subset of an example data:
Chr Pos REF ALT A B C D E
1 70 A G 1 0 1 1 1
1 80 T G 1 0 3 3 3
1 100 C T 1 0 1 1 1
2 20 G A 1 0 0 0 1
2 80 C T 1 0 0 0 2
The desired output is:
Chr Pos REF ALT A B C D E
1 70 A G A/A -/- A/A A/A A/A
1 80 T G T/T -/- T/G T/G T/G
1 100 C T C/C -/- C/C C/C C/C
2 20 G A G/G -/- -/- -/- G/G
2 80 C T C/C -/- -/- -/- T/T
Reproducible data frame:
df=data.frame(
Chr=c(1,1,1,2,2),
Pos=c(70,80,100,20,80),
REF=c("A","T","C","G","C"),
ALT=c("G","G","T","A","T"),
A=c(1,1,1,1,1),
B=c(0,0,0,0,0),
C=c(1,3,1,0,0),
D=c(1,3,1,0,0),
E=c(1,3,1,1,2)
)
I wrote a for loop for the task:
K=data.frame()
for (r in 1:nrow(df)){
k=df[r,]
ks=df[r,-1:-4]
ks[ks==0]="-/-"
ks[ks==1]=paste0(k$REF,"/",k$REF)
ks[ks==2]=paste0(k$ALT,"/",k$ALT)
ks[ks==3]=paste0(k$REF,"/",k$ALT)
ks=cbind(k[,1:4],ks)
K=rbind(K,ks)
}
That worked okay, however, I have about 150,000 lines of them and the row by row operation takes very long time so I was wondering if there is a faster way to process this?
Thank you very much for the help!
CodePudding user response:
We may use case_when
- loop across
the columns 'A' to 'E', create the format with sprintf
based on the condition created in `case_when
library(dplyr)
df %>%
mutate(across(A:E, ~ case_when(. == 0 ~ '-/-',
. == 1 ~ sprintf('%1$s/%1$s', REF),
.==2 ~ sprintf('%1$s/%1$s', ALT),
TRUE ~ sprintf('%s/%s', REF, ALT))))
-output
Chr Pos REF ALT A B C D E
1 1 70 A G A/A -/- A/A A/A A/A
2 1 80 T G T/T -/- T/G T/G T/G
3 1 100 C T C/C -/- C/C C/C C/C
4 2 20 G A G/G -/- -/- -/- G/G
5 2 80 C T C/C -/- -/- -/- T/T
NOTE: As this is done by column and not by row, it should be much faster than the OP's approach