I have this text:
>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A
When I split it and remove the line jumps, I get this list:
It gives me a list that looks like this:
['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
I'm trying to merge all the strings between each part that starts with '>', such that it looks like:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
What I have so far, but it doesn't do anything and I'm lost:
my_list = ['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
temp = ''
while my_list[item] != '>':
temp = my_list[item]
result.append(temp)
print(result)
CodePudding user response:
You can use itertools.groupby
for the task:
from itertools import groupby
lst = [
">A1",
"KKKKKKKK",
"DDDDDDDD",
">A2",
"FFFFFFFF",
"FFFFOOOO",
"DAA",
">A3",
"OOOZDDD",
"KKAZAAA",
"A",
]
out = []
for k, g in groupby(lst, lambda s: s.startswith(">")):
if not k:
out.append("".join(g))
print(out)
Prints:
["KKKKKKKKDDDDDDDD", "FFFFFFFFFFFFOOOODAA", "OOOZDDDKKAZAAAA"]
CodePudding user response:
@Andrej has given a compact code for your problem, but I want to help you by pointing out some issues in your original code.
- You have
while
inif
, but whenmy_list[item]
starts with'>'
, the innerwhile
won't work. The correct thing is to add aelse-statement
to concatenate the following string. - You append a string
temp
toresult
at each iterative step, buttemp
is not a concatenated string. The correct time to append is when you meet'>'
again.
After solving them, you may get something like this,
result = []
for item in range(len(my_list)):
if my_list[item][0] == '>':
if item != 0:
result.append(temp)
temp = ''
else:
temp = my_list[item]
if item != 0:
result.append(item)
print(result)
You can further simplify it.
- Save list indexing by directly iterating over the list.
- Save final repeated check by adding a sentinel.
result = []
concat_string = '' # just change a readable name
for string in my_list ['>']: # iterate over list directly and add a sentinel
if string[0] == '>': # or string.startswith('>')
if concat_string:
result.append(concat_string)
concat_string = ''
else:
concat_string = string
print(result)
CodePudding user response:
Regex version:
data = """>A1
KKKKKKKK
DDDDDDDD
>A2
FFFFFFFF
FFFFOOOO
DAA
>A3
OOOZDDD
KKAZAAA
A"""
import re
patre = re.compile("^>. \n",re.MULTILINE)
#split on `>xxx`
chunks = patre.split(data)
#remove whitespaces and newlines
blocks = [v.replace("\n","").strip() for v in chunks]
#get rid of leading trailing empty blocks
blocks = [v for v in blocks if v]
print(blocks)
output:
['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']