Home > OS >  How to group list items based on a specific condition?
How to group list items based on a specific condition?

Time:11-19

I have this text:

>A1
KKKKKKKK
DDDDDDDD

>A2
FFFFFFFF
FFFFOOOO
DAA

>A3
OOOZDDD
KKAZAAA
A

When I split it and remove the line jumps, I get this list:

It gives me a list that looks like this:

['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']

I'm trying to merge all the strings between each part that starts with '>', such that it looks like:

['KKKKKKKKDDDDDDDD',  'FFFFFFFFFFFFOOOODAA',  'OOOZDDDKKAZAAAA']

What I have so far, but it doesn't do anything and I'm lost:

my_list = ['>A1', 'KKKKKKKK', 'DDDDDDDD', '>A2', 'FFFFFFFF', 'FFFFOOOO', 'DAA', '>A3', 'OOOZDDD', 'KKAZAAA', 'A']

result = []
for item in range(len(my_list)):
    if my_list[item][0] == '>':
        temp = ''
        while my_list[item] != '>':
            temp  = my_list[item]
    result.append(temp)

print(result)

CodePudding user response:

You can use itertools.groupby for the task:

from itertools import groupby

lst = [
    ">A1",
    "KKKKKKKK",
    "DDDDDDDD",
    ">A2",
    "FFFFFFFF",
    "FFFFOOOO",
    "DAA",
    ">A3",
    "OOOZDDD",
    "KKAZAAA",
    "A",
]

out = []
for k, g in groupby(lst, lambda s: s.startswith(">")):
    if not k:
        out.append("".join(g))

print(out)

Prints:

["KKKKKKKKDDDDDDDD", "FFFFFFFFFFFFOOOODAA", "OOOZDDDKKAZAAAA"]

CodePudding user response:

@Andrej has given a compact code for your problem, but I want to help you by pointing out some issues in your original code.

  1. You have while in if, but when my_list[item] starts with '>', the inner while won't work. The correct thing is to add a else-statement to concatenate the following string.
  2. You append a string temp to result at each iterative step, but temp is not a concatenated string. The correct time to append is when you meet '>' again.

After solving them, you may get something like this,

result = []
for item in range(len(my_list)):
    if my_list[item][0] == '>':
        if item != 0:
            result.append(temp)
        temp = ''
    else:
        temp  = my_list[item]
if item != 0:
    result.append(item)
print(result)

You can further simplify it.

  1. Save list indexing by directly iterating over the list.
  2. Save final repeated check by adding a sentinel.
result = []
concat_string = '' # just change a readable name
for string in my_list   ['>']: # iterate over list directly and add a sentinel
    if string[0] == '>': # or string.startswith('>')
        if concat_string:
            result.append(concat_string)
        concat_string = ''
    else:
        concat_string  = string
print(result)

CodePudding user response:

Regex version:

data = """>A1
KKKKKKKK
DDDDDDDD

>A2
FFFFFFFF
FFFFOOOO
DAA

>A3
OOOZDDD
KKAZAAA
A"""

import re

patre = re.compile("^>. \n",re.MULTILINE)
#split on `>xxx`
chunks = patre.split(data)
#remove whitespaces and newlines
blocks = [v.replace("\n","").strip() for v in chunks]
#get rid of leading trailing empty blocks
blocks = [v for v in blocks if v]

print(blocks)

output:

['KKKKKKKKDDDDDDDD', 'FFFFFFFFFFFFOOOODAA', 'OOOZDDDKKAZAAAA']
  • Related