I have a data frame with 25 million rows and I need to run a substring function to all 25 million rows of data. Because of the size of the data frame I thought apply would be the most efficient way of doing this.
df <- data.frame( seq_start=c(75, 59, 44),
seq_end=c(151, 135, 120),
sequence=c("NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA", "NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG", "NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA"))
Function to accomplish this that I thought would be the most efficient:
apply(df,1,substr(sequence,seq_start,seq_end))
I'm not familiar with the apply function and a loop is way to inefficient to process 25 million lines.
CodePudding user response:
Not 100% sure what you need/want but it seems that using the dplyr
syntax is useful here (more useful than apply
as you're only looking to extract a substring from a single column)
library(dplyr)
df %>%
mutate(substring = substr(sequence,seq_start,seq_end))
seq_start seq_end
1 75 151
2 59 135
3 44 120
sequence
1 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG
3 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA
substring
1 ATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 TAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACAC
3 AAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATAT
Base R:
df$substring <- substr(df$sequence,df$seq_start,df$seq_end)