Home > Blockchain >  Find consecutive series in list of tuples in python
Find consecutive series in list of tuples in python

Time:11-23

I am struggling with the following issue: I would like to write some small code to deisotope mass spec data.

For this, I compare, if the difference between two signals is equal the mass of a proton devided by the charge state. So far, so easy.

I am struggling now, to find series of more than two peaks.

I broke down the problem to having a list of tuples, and a series are n tuples, where the last number of the previous tuple is equal the first tuple of the current tuple.

From this:

[(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]

To this:

[(1,2,3), (4,5), (7,9,11), (8,10)]

Simple order will fail, as there might be jumps (7-->9) and an intermediate signal (8,10)

Here is some test data:

import numpy as np

proton = 1.0078250319 

data = [
 (632.3197631835938, 2244.3374),  #0
 (632.830322265625, 2938.797),    #1
 (634.3308715820312, 1567.1385),  #2
 (639.3309326171875, 80601.41),   #3
 (640.3339233398438, 23759.367),  #4
 (641.3328247070312, 4771.9946),  #5
 (641.8309326171875, 2735.902),   #6
 (642.3365478515625, 4600.567),   #7
 (642.84033203125, 1311.657),     #8
 (650.34521484375, 11952.237),    #9
 (650.5, 1),                      #10 
 (650.84228515625, 10757.939),    #11
 (651.350341796875, 6324.9023),   #12
 (651.8455200195312, 1398.8452),  #13
 (654.296875, 1695.3457)]         #14

mz, i = zip(*data)

mz = np.array(mz)
i = np.array(i)

arr = np.triu(mz - mz[:, np.newaxis])

charge = 2

So actually, in the first step, I am just interested in the mz values. I substract all values from all values and isolate the upper triangle.

To calculate, if two signals are actually within the correct mass, I then use the following code:

>>> pairs = tuple(zip(*np.where(abs(arr - (proton / charge)) < 0.01)))
((0, 1), (5, 6), (6, 7), (7, 8), (9, 11), (11, 12), (12, 13))

Now, to corresponding signals are clear by eye:

Peak 1: 0 to 1

Peak 2: 5 to 8

Peak 3: 9 to 13, without 10.

So in principle, I would like compare the 2nd value of each tuple with the first tuple of any other, to identify consequtive sequnces.

What I tried, is to flatten the list, remove duplicates and find consequtive counting in this 1D list. But this fails, as a peak from 5-9 is found.

I would like to have a vectorized solution, as this calculation is done for 100-500 signals for multiple charge states in 30000 spectra.

I am pretty sure, this had been asked before, but was not able to find a suitable solution.

Eventually, these series are than used to check the intensity of the corresponding peaks, sum them and use the biggest initial value to assign the deisotoped peak here.

Thx Christian

ps. also if there are some suggestion to the already existing code, I am happy to learn. I am pretty new to vectorized calculation, and usually wrote tons of for loops, which take ages to finish.

CodePudding user response:

In graph theory your problem will be "How to find all disconnected subgraph in a graph ?".

So why not using a network analysis library such as networkx:

import networkx as nx
# Your tuples become the edges of the graph.
edge = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]

# We create the graph
G = nx.Graph()
G.add_edges_from(edge)

# Use connected_components to detect the subgraphs.
d = list(nx.connected_components(G)) 

And we obtain as expected:

[{1, 2, 3}, {4, 5}, {7, 9, 11}, {8, 10}]

With 4 subgraphs:

enter image description here

CodePudding user response:

Try:

d = {}
pairs = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]
for t in pairs:
    if t[0] in d:
       d[t[0]].append(t[1])
       d[t[1]] = d.pop(t[0])
    else:
        d[t[1]] = list(t)
signals = tuple(tuple(v) for v in d.values())

Though the solution isn't vectorized it will be O(n) time. Note that this solution only works if you pairs are sorted.

CodePudding user response:

You can also try this, unfortunately not O(n) solution, but O(nlog(n)) or O(n*n) due to double cycle:

#!/usr/bin/env ipython
# --------------------
datain = [(1,2), (2,3), (4,5), (7,9), (8,10), (9,11)]
dataout = [];
# ---------------------------------------------------
for ii,val_a in enumerate(datain):
    print(val_a)
    appendval = set(val_a)
    for jj,val_b in enumerate(datain[ii 1:]):
        # -------------------------------------------
        if val_a[1] == val_b[0]:
            appendval = appendval.union(val_b)
            datain.remove(val_b)
        # -------------------------------------------
    dataout.append(tuple(appendval))
# ---------------------------------------------------
  • Related