I have a long fasta file and I need to format the lines. I tried many things but since I'm not much familiar python I couldn't solve exactly.
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I want them to look like:
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried this:
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks = stripped_line
a_file.close()
print(string_without_line_breaks)
But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you
CodePudding user response:
A common arrangement is to remove the newline, and then add it back when you see the next record.
# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
# Keep track of whether we have written something without a newline
written_lines = False
for line in a_file:
# Use standard .startswith()
if line.startswith(">"):
if written_lines:
print()
written_lines = False
print(line, end='')
else:
print(line.rstrip('\n'), end='')
written_lines = True
if written_lines:
print()
A common beginner bug is forgetting to add the final newline after falling off the end of the loop.
This simply prints one line at a time and doesn't return anything. Probably a better design would be to collect and yield
one FASTA record (header sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that - BioPython seems to be the go-to solution for bioinformatics.
CodePudding user response:
Since you’re working with FASTA data, another solution would be to use a dedicated library, in which case what you want is a one-liner:
from Bio import SeqIO
SeqIO.write(SeqIO.parse('file.fasta', 'fasta'), sys.stdout, 'fasta-2line')
Using the 'fasta-2line'
format description tells SeqIO.write
to omit line breaks inside sequences.
CodePudding user response:
First the usual disclaimer: operate on files using a with
block when at all possible. Otherwise they won't be closed on error.
Observe that you want to remove newlines on every line not starting with >
, except the last one of every block. You can achieve the same effect by stripping the newline after every line that doesn't start with >
, and prepend a newline to each line starting with >
except the first.
out = sys.stdout
with open(..., 'r') as file:
first = True
hasline = False
for line in file:
if line.startswith('>'):
if not first:
out.write('\n')
out.write(line)
first = False
else:
out.write(line.rstrip())
hasline = True
if hasline:
out.write('\n')
Printing as you go is much simpler than accumulating the strings in this case. Printing to a file using the write
method is simpler than using print
when you're just transcribing lines.
CodePudding user response:
I have edited some mistakes in your code.
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
needed_lines = []
for line in a_file:
if line.strip().startswith(">") or line.strip() == "":
# If there was any lines appended before, commit it.
if string_without_line_breaks != "":
needed_lines.append(string_without_line_breaks)
string_without_line_breaks = ""
needed_lines.append(line)
continue
else:
stripped_line = line.strip()
string_without_line_breaks = stripped_line
a_file.close()
print("\n".join(needed_lines))
CodePudding user response:
Please make sure to add the lines containing the right bracket (>
) to your string.
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
string_without_line_breaks = "\n" line
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks = stripped_line
a_file.close()
print(string_without_line_breaks)
By the way, you can turn this into a one liner:
import re
with open("file.fasta", 'r') as f:
data = f.read()
result = re.sub(r"^(?!>)(.*)$\n(?!>)", r"\1", data, flags=re.MULTILINE)
print(result)
The regex contains a negative lookahead to prevent trimming lines starting with >
, and prevents trimming lines that are right before a >
CodePudding user response:
You need to tell rstrip()
function what you want to strip off. Simply line.strip('\n')
will do.
CodePudding user response:
If I understood correctly, you need to convert to one line/no line-break, as following:
nobreak_seq1 = ''.join(seq1.splitlines())
nobreak_seq2 = ''.join(seq2.splitlines())
(Note: before the "join" is two quotes and not double-quote)