Home > Net >  sum elements in python list if match condition
sum elements in python list if match condition

Time:11-27

I have a variable with lists with varied number of elements:

['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M']
['124', 'M', '19', 'M', '7', 'M']
['19', 'M', '131', 'M']
['3', 'M', '19', 'M', '128', 'M']
['12', 'M', '138', 'M']

Variable is always number, letter and order matters.

I would to add the values only of consecutive Ms to be (i.e. if there is a D, skip the sum):

['30', 'M', '1', 'D', '120', 'M']
['150', 'M']
['150', 'M']
['150', 'M']
['150', 'M']

ps. the complete story is that I want to convert soft clips to match in a bam file, but got stuck in that step.

#!/usr/bin/python

import sys 
import pysam

bamFile = sys.argv[1];

bam = pysam.AlignmentFile(bamFile, 'rb')

for read in bam:
    cigar=read.cigarstring
    sepa = re.findall('(\d |[A-Za-z] )', cigar)
    
    for i in range(len(sepa)):
        if sepa[i] == 'S':
            sepa[i] = 'M'
            

CodePudding user response:

You can slice Python lists using a step (sometimes called a stride), you can use this to get every second element, starting at index 1 (for the first letter):

>>> example = ['30', 'M', '1', 'D', '120', 'M']
>>> example[1::2]
['M', 'D', 'M']

The [1::2] syntax means: start at index 1, go on until you run out of elements (nothing entered between the : delimiters), and step over the list to return every second value.

You can do the same thing for the numbers, using [::2], so begin with the value right at the start and take every other value.

If you then combine this with the zip() function you can pair up your numbers and letters to figure out what to sum:

def sum_m_values(values):
    summed = []
    m_sum = 0
    for number, letter in zip(values[::2], values[1::2]):
        if letter != "M":
            if m_sum:
                summed  = (str(m_sum), "M")
                m_sum = 0
            summed  = (number, letter)
        else:
            m_sum  = int(number)
    if m_sum:
        summed  = (str(m_sum), "M")
    return summed

The above function takes your list of numbers and letters and:

  • creates a list for the results
  • tracks a running sum of "M" values
  • pairs up the numbers and letters
  • for each pair:
    • if it is a number and "M", add that value (as an integer) to the running sum.
    • otherwise, adds the running sum (if any) to the list with the letter "M", then adds the current number and letter too.
  • after all pairs are processed, adds the running sum and the letter "M", if there is any.

This covers all your example inputs:

>>> def sum_m_values(values):
...     summed = []
...     m_sum = 0
...     for number, letter in zip(values[::2], values[1::2]):
...         if letter != "M":
...             if m_sum:
...                 summed  = (str(m_sum), "M")
...                 m_sum = 0
...             summed  = (number, letter)
...         else:
...             m_sum  = int(number)
...     if m_sum:
...         summed  = (str(m_sum), "M")
...     return summed
...
>>> examples = [
...     ['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'],
...     ['124', 'M', '19', 'M', '7', 'M'],
...     ['19', 'M', '131', 'M'],
...     ['3', 'M', '19', 'M', '128', 'M'],
...     ['12', 'M', '138', 'M'],
... ]
>>> for example in examples:
...     print(example, "->", sum_m_values(example))
...
['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'] -> ['30', 'M', '1', 'D', '120', 'M']
['124', 'M', '19', 'M', '7', 'M'] -> ['150', 'M']
['19', 'M', '131', 'M'] -> ['150', 'M']
['3', 'M', '19', 'M', '128', 'M'] -> ['150', 'M']
['12', 'M', '138', 'M'] -> ['150', 'M']

There are other methods of looping over a list in fixed-sized groups; you can also create an iterator for the list with iter():

it = iter(inputlist)
for number, letter in zip(it, it):
    # ...

This works because zip() gets the next element for each value in the pair from the same iterator, so "30" first, then "M", etc.:

>>> example = ['124', 'M', '19', 'M', '7', 'M']
>>> it = iter(example)
>>> for number, letter in zip(it, it):
...     print(number, letter)
...
124 M
19 M
7 M

However, for short lists it is perfectly fine to use slicing, as it can be understood more easily.

Next, you can make the summing a little easier by using itertools.groupby() give you your number letter pairs as separate groups. That function takes an input sequence, and a function to produce the group identifier. When you then loop over its output you are given that group identifier and an iterator to access the group members.

Just pass it the zip() iterator build before, and either lambda pair: pair[1] or operator.itemgetter(1); the latter is a little faster but does the same thing as the lambda, get the letter from the number letter pair.

With separate groups, the logic starts to look a lot simpler:

from itertools import groupby
from operator import itemgetter

def sum_m_values(values):
    summed = []
    it = iter(values)
    paired = zip(it, it)

    for letter, grouped in groupby(paired, itemgetter(1)):
        if letter == "M":
            total = sum(int(number) for number, _ in grouped)
            summed  = (total, letter)
        else:
            # add the (number, "D") as separate elements
            for number, letter in grouped:
                summed  = (number, letter)
            
    return summed

The output of the function hasn't changed, only the implementation.

Finally, we could turn the function into a generator function, by replacing the summed = ... statements with yield from ..., so it'll still generate a sequence of numeric strings and letters:

from itertools import groupby
from operator import itemgetter

def sum_m_values(values):
    it = iter(values)
    paired = zip(it, it)

    for letter, grouped in groupby(paired, itemgetter(1)):
        if letter == "M":
            total = sum(int(number) for number, _ in grouped)
            yield from (total, letter)
        else:
            # add the (number, "D") as separate elements
            for number, letter in grouped:
                yield from (number, letter)

You can then use list(sum_m_values(...)) to get a list again, or just use the generator as-is. For long inputs, that could be the preferred option as that means you never need to keep everything in memory all at once.

CodePudding user response:

solution using itertools package:

>>> from itertools import groupby, chain
>>> records = [
...     ['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'],
...     ['124', 'M', '19', 'M', '7', 'M'],
...     ['19', 'M', '131', 'M'],
...     ['3', 'M', '19', 'M', '128', 'M'],
...     ['12', 'M', '138', 'M'],
... ]
>>> res = []
>>> for rec in records:
...     res.append(list(
...         chain.from_iterable(
...             map(
...                 lambda x: (
...                     str(sum(map(lambda y: y[0], x[1]))),
...                     x[0],
...                 ),
...                 groupby(
...                     zip(map(int, rec[::2]), rec[1::2]),
...                     lambda k: k[1]
...                 )
...             )
...         )
...     ))
...
>>> res
[['30', 'M', '1', 'D', '120', 'M'], ['150', 'M'], ['150', 'M'], ['150', 'M'], ['150', 'M']]

CodePudding user response:

Suppose you have that list of lists as input:

LoL=[
    ['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'],
    ['20', 'M', '10', 'M', '1', 'D', '2', 'D', '14', 'M', '106', 'M'],
    ['124', 'M', '19', 'M', '7', 'M'],
    ['19', 'M', '131', 'M'],
    ['3', 'M', '19', 'M', '128', 'M'],
    ['12', 'M', '138', 'M'],
]

If you want to sum consecutive values of M (in each sub list) you can use groupby from itertools to step through the list by the second element:

from itertools import groupby


for l in LoL:
    result=[]
    for k, v in groupby((l[x:x 2] for x in range(0,len(l),2)), 
                                             key=lambda l: l[1]):
        result.extend([sum(int(l[0]) for l in v), k])
    print(result)

Prints:

[30, 'M', 1, 'D', 120, 'M']
[30, 'M', 3, 'D', 120, 'M']
[150, 'M']
[150, 'M']
[150, 'M']
[150, 'M']

If you only want to sum 'M' entries, just test k:

for l in LoL:
    result=[]
    for k, v in groupby((l[x:x 2] for x in range(0,len(l),2)), 
                                            key=lambda l: l[1]):
        if k=='M':
            result.extend([sum(int(l[0]) for l in v), k])
        else:
            for e in v:
                result.extend([e[0], k])
    print(result)

Prints:

[30, 'M', '1', 'D', 120, 'M']
[30, 'M', '1', 'D', '2', 'D', 120, 'M']
[150, 'M']
[150, 'M']
[150, 'M']
[150, 'M']
  • Related