remove specific endline breaks in Python-CodePudding

I have a long fasta file and I need to format the lines. I tried many things but since I'm not much familiar python I couldn't solve exactly.

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I want them to look like:

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I've tried this:

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks  = stripped_line
a_file.close()
print(string_without_line_breaks)

But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you

CodePudding user response：

A common arrangement is to remove the newline, and then add it back when you see the next record.

# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
    # Keep track of whether we have written something without a newline
    written_lines = False
    for line in a_file:
        # Use standard .startswith()
        if line.startswith(">"):
            if written_lines:
                print()
                written_lines = False
            print(line, end='')
        else:
            print(line.rstrip('\n'), end='')
            written_lines = True
    if written_lines:
        print()

A common beginner bug is forgetting to add the final newline after falling off the end of the loop.

This simply prints one line at a time and doesn't return anything. Probably a better design would be to collect and yield one FASTA record (header sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that - BioPython seems to be the go-to solution for bioinformatics.

CodePudding user response：

Since you’re working with FASTA data, another solution would be to use a dedicated library, in which case what you want is a one-liner:

from Bio import SeqIO

SeqIO.write(SeqIO.parse('file.fasta', 'fasta'), sys.stdout, 'fasta-2line')

Using the 'fasta-2line' format description tells SeqIO.write to omit line breaks inside sequences.

CodePudding user response：

First the usual disclaimer: operate on files using a with block when at all possible. Otherwise they won't be closed on error.

Observe that you want to remove newlines on every line not starting with >, except the last one of every block. You can achieve the same effect by stripping the newline after every line that doesn't start with >, and prepend a newline to each line starting with > except the first.

out = sys.stdout
with open(..., 'r') as file:
    first = True
    hasline = False
    for line in file:
        if line.startswith('>'):
            if not first:
                out.write('\n')
            out.write(line)
            first = False
        else:
            out.write(line.rstrip())
            hasline = True
    if hasline:
        out.write('\n')

Printing as you go is much simpler than accumulating the strings in this case. Printing to a file using the write method is simpler than using print when you're just transcribing lines.

CodePudding user response：

I have edited some mistakes in your code.

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
needed_lines = []
for line in a_file:
    if line.strip().startswith(">") or line.strip() == "":
        # If there was any lines appended before, commit it.
        if string_without_line_breaks != "":
            needed_lines.append(string_without_line_breaks)
            string_without_line_breaks = ""
        needed_lines.append(line)
        continue
    else:
        stripped_line = line.strip()
        string_without_line_breaks  = stripped_line
a_file.close()
print("\n".join(needed_lines))

CodePudding user response：

Please make sure to add the lines containing the right bracket (>) to your string.

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        string_without_line_breaks  = "\n"   line
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks  = stripped_line
a_file.close()
print(string_without_line_breaks)

By the way, you can turn this into a one liner:

import re

with open("file.fasta", 'r') as f:
    data = f.read()

result = re.sub(r"^(?!>)(.*)$\n(?!>)", r"\1", data, flags=re.MULTILINE)

print(result)

The regex contains a negative lookahead to prevent trimming lines starting with >, and prevents trimming lines that are right before a >

CodePudding user response：

You need to tell rstrip() function what you want to strip off. Simply line.strip('\n') will do.

CodePudding user response：

If I understood correctly, you need to convert to one line/no line-break, as following:

nobreak_seq1 = ''.join(seq1.splitlines())

nobreak_seq2 = ''.join(seq2.splitlines())

(Note: before the "join" is two quotes and not double-quote)