MyData looks like
ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F, rs7077812_M, rs7077812_F
1 C T G T C A
2 C C T G A A
3 A A G G C G
4 G G C C A A
Note: Like above I have 522 rows and 369 Columns ( of individual Mother & Father SNP values) which I need to combine
I'm working with GWAS data, these were the SNPs id of our cell containing One Mother Chromosome and Father Chromosome. I would like to combine individual SNP values of M & F into One and Name it with SNP ID (kgp11274425_M kgp11274425_F = kgp11274425)
Desired Output:
ID, kgp11274425, kgp5732633, rs7077812
1 CT GT CA
2 CC TG AA
3 AA GG CG
4 GG CC AA
Can Anyone please help me with Python Code, all support & help needed
CodePudding user response:
Here you go:
import pandas as pd
df = pd.DataFrame({
'ID' : [1,2,3,4],
'kgp11274425_M' : ['C','C','A','G'],
'kgp11274425_F' : ['T','C','A','G'],
'kgp5732633_M' : ['G','T','G','C'],
'kgp5732633_F' : ['T','G','G','C'],
'rs7077812_M' : ['C','A','C','A'],
'rs7077812_F' : ['A','A','G','A']
})
print(df)
from collections import defaultdict
data = defaultdict(list)
def mergeMF(row):
data['ID'] = [row['ID']]
cols = defaultdict(list)
for colName in row.index:
if colName == 'ID':
continue
newColName = colName[:-2]
isMale = colName[-1] == 'M'
cols[newColName] = [row[colName] if isMale else (cols[newColName][0] if len(cols[newColName]) else None), row[colName] if not isMale else (cols[newColName][1] if len(cols[newColName]) else None)]
for k, v in cols.items():
data[k] = [''.join(v)]
df.apply(mergeMF, axis = 1)
df2 = pd.DataFrame(data)
print(df2)
Output
ID kgp11274425_M kgp11274425_F kgp5732633_M kgp5732633_F rs7077812_M rs7077812_F
0 1 C T G T C A
1 2 C C T G A A
2 3 A A G G C G
3 4 G G C C A A
ID kgp11274425 kgp5732633 rs7077812
0 1 CT GT CA
1 2 CC TG AA
2 3 AA GG CG
3 4 GG CC AA
CodePudding user response:
A simplified example:
data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1, C, T, G, T,
2, C, C, T, G,
3, A, A, G, G,
4, G, G, C, C""")
df = pd.read_csv(data, sep=",")
cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
' kgp5732633_F']
df = df[cols]
df = df.set_index('ID')
unique_cols = pd.Index(map(lambda x : str(x)[:-2], df.columns)).unique()
results = []
columns = []
for col in unique_cols:
my_cols = [x for x in df.columns if x.startswith(col)]
results.append(df[my_cols].sum(axis=1).values)
columns.append(col)
new_df = pd.DataFrame(results).T
new_df.columns = columns
This returns:
kgp11274425 kgp5732633
0 CT GT
1 CC TG
2 AA GG
3 GG CC
CodePudding user response:
Split your columns into a multiindex: (kgp11274425, M), (kgp11274425, F), (kgp5732633, M), (kgp5732633, F), ...
then combine the columns together:
result = df.set_index("ID")
result.columns = pd.MultiIndex.from_tuples(result.columns.to_series().str.split("_").apply(tuple))
result = result.xs("M", axis=1, level=1) result.xs("F", axis=1, level=1)