I'm a bioinformatician and I'm dealing with a very large text file. The size of the file is 64.2 GB. I did word count on my text file and got this result after quite some time.
1052454251 1052456168 64199706147 GRCh38.fa
My problem is I want to delete only few lines from this file. I searched in google for python commands to delete any specific line from text file, but almost all of them are suggesting writing the complete file while skipping only the part which we want to delete.
I followed the same approach even before referring google, but initially I encountered memory issues (because I used readlines() function and tried to read the entire file at one stretch, so my system got hanged). Then I read the file one line at a time and filtered the lines which I need. However, this approach is very time consuming since my input file is 64 GB.
Can anyone please suggest if there is any specific python/unix command to delete only specific lines from a file?
For more information about my problem, my input file contains a complete human genome sequence in fasta format. It looks like this
>chr1 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chr2 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chr3 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
.
.
.
>chr22 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chrM and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chrX and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chrY and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
Here the line starting with ">" is called header. the rest of the lines contains the actual DNA sequence.
Immediately after ">" I have chromosome name (chr1 or chr2 or chr3 or .... chrM or chrX or chrY). I just want to delete the line chrM and the DNA sequence lines below them. So my output file should look like
>chr1 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chr2 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chr3 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
.
.
.
.
>chr22 and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chrX and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
>chrY and some description
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG....
Here is the code that I wrote.
from memory_profiler import profile
@profile #This is just to check how much memory is used.
def memcheck():
g=open("chrMremoved.fa",'w')
to_write=1
with open("GRCh38.fa") as f:
for i in f:
if(i[0]==">"):
if(i[0:5]==">chrM"):
to_write=0
else:
to_write=1
if(to_write==1):
g.write(i)
g.close()
if __name__ == "__main__":
memcheck()
Also since most of my work revolves around analysing huge data set. It'll be helpful for me if someone suggest me some tips on how to deal with large data sets in python (like writing memory efficient and time efficient code).
Some times I have encountered issues where my python code is getting killed, I searched google and found out that it's because of less memory in RAM. Please guide me what I can do in such situations.
CodePudding user response:
Use a specialized tool like seqkit to filter your sequences by header pattern.
seqkit grep -v -p chrM GRCh38.fa -o chrMremoved.fa