I have created a list of sequence names and sequences from a fasta file. Does anybody know how I can remove the '>' character from the sequence names list? I have tried using strip, replace, map. The list provides the following output:
>chrI
>chrII
>chrIII
where it should be:
chrI
chrII
chrIII
fp = open(r'demo_fasta_file_2022.fas', 'r')
def read_fasta(fp):
sequence_names, sequences = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if sequence_names: yield (sequence_names, ''.join(sequences))
sequence_names, sequences = line, []
else:
sequences.append(line)
if sequence_names: yield (sequence_names, ''.join(sequences))
with open('demo_fasta_file_2022.fas') as fp:
for sequence_names, sequences in read_fasta(fp):
print(sequence_names)
CodePudding user response:
this process is called String Slicing. There are a lot of ways to do it. This might help: https://www.w3schools.com/python/gloss_python_string_slice.asp
CodePudding user response:
Just slice:
print(line[1:])
If you are unsure of the presence of '>', use:
if line.startswith(">"):
print(line[1:])
else:
print(line)
CodePudding user response:
You can also use a regex, which is a little bit safer than line[1:]
import re
# ...
line = re.sub(r'^>', '', line, flags=re.MULTILINE)
Where ^
is a sign for the start of the line and the function signature is re.sub(REGEX, REPLACE_WITH, INPUTSTRING)
.
re.MULTILINE
allows you to use ^
and $
for start/end of lines.