I have a large dataframe which features mutations in the format: "R68S, M90V, Y227A, F327A", etc. where the letters represent single letter abbreviations for amino acids, and the numbers represent the location of these mutations within the genome.
Minimal, reproducible example:
import pandas as pd
data = [[31581, "wild-type"], [31614, "D250C,E296C"], [31731, "T112K,T116I,E324I,S150C,N157C,V189C,D332C"]]
df = pd.DataFrame(data, columns=['SAMPLE', 'MUTATION'])
df
My code is as follows:
df2 = df1[df1["MUTATION"].str.contains("wild-type") == False]
df2["MUTATION"] = df2["MUTATION"].str.strip('(Based on UniProt and PDB), (Based on PDB), (Based on UniProt), (Based on Paper)')
filtered = df2["MUTATION"].str.split('/|;|,| |:')
filtered = df2["MUTATION"].str.split('(\d )')
for m_item in filtered:
if len(m_item)>=9:
print (m_item)
This is how I cleaned up the data, and the new format to separate the numbers from the letters yields: "['R', '68', 'S M', '90', 'V Y', '227', 'A F', '327', 'A]. I want to know how far apart these mutations are by producing a list of their distances, so for the above example I will need (327 - 227), (227 - 90), and (90 - 68). There are over 30,000 rows like this in my dataframe, so I cannot use a shortcut method. I am new to Python, and any help is greatly appreciated!
CodePudding user response:
Let us first find all the locations of mutation from each genome then map
a lambda function which calculates the distance between consecutive locations
s = df['MUTATION'].str.findall(r'\b[A-Z](\d )[A-Z]\b')
df["DISTANCE"] = s.map(lambda l: [int(a) - int(b) for a, b in zip(l[1:], l[:-1])])
SAMPLE MUTATION DISTANCE
0 31581 wild-type []
1 31614 D250C,E296C [46]
2 31731 T112K,T116I,E324I,S150C,N157C,V189C,D332C [4, 208, -174, 7, 32, 143]