How do I measure the difference between two values within each row of my dataframe if they are separ-CodePudding

I have a large dataframe which features mutations in the format: "R68S, M90V, Y227A, F327A", etc. where the letters represent single letter abbreviations for amino acids, and the numbers represent the location of these mutations within the genome.

Minimal, reproducible example:

import pandas as pd 
data = [[31581, "wild-type"], [31614, "D250C,E296C"], [31731, "T112K,T116I,E324I,S150C,N157C,V189C,D332C"]]
df = pd.DataFrame(data, columns=['SAMPLE', 'MUTATION'])
df

My code is as follows:

df2 = df1[df1["MUTATION"].str.contains("wild-type") == False]
df2["MUTATION"] = df2["MUTATION"].str.strip('(Based on UniProt and PDB), (Based on PDB), (Based on UniProt), (Based on Paper)')
filtered = df2["MUTATION"].str.split('/|;|,| |:')
filtered = df2["MUTATION"].str.split('(\d )')

for m_item in filtered:
  if len(m_item)>=9:
    print (m_item)

This is how I cleaned up the data, and the new format to separate the numbers from the letters yields: "['R', '68', 'S M', '90', 'V Y', '227', 'A F', '327', 'A]. I want to know how far apart these mutations are by producing a list of their distances, so for the above example I will need (327 - 227), (227 - 90), and (90 - 68). There are over 30,000 rows like this in my dataframe, so I cannot use a shortcut method. I am new to Python, and any help is greatly appreciated!

CodePudding user response：

Let us first find all the locations of mutation from each genome then map a lambda function which calculates the distance between consecutive locations

s = df['MUTATION'].str.findall(r'\b[A-Z](\d )[A-Z]\b')
df["DISTANCE"] = s.map(lambda l: [int(a) - int(b) for a, b in zip(l[1:], l[:-1])])

   SAMPLE                                   MUTATION                    DISTANCE
0   31581                                  wild-type                          []
1   31614                                D250C,E296C                        [46]
2   31731  T112K,T116I,E324I,S150C,N157C,V189C,D332C  [4, 208, -174, 7, 32, 143]