Add Two Column Values in DataFrame and name it with its Prefix Name-CodePudding

MyData looks like

ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F, rs7077812_M, rs7077812_F
1       C                T            G             T              C            A
2       C                C            T             G              A            A
3       A                A            G             G              C            G
4       G                G            C             C              A            A

Note: Like above I have 522 rows and 369 Columns ( of individual Mother & Father SNP values) which I need to combine

I'm working with GWAS data, these were the SNPs id of our cell containing One Mother Chromosome and Father Chromosome. I would like to combine individual SNP values of M & F into One and Name it with SNP ID (kgp11274425_M kgp11274425_F = kgp11274425)

Desired Output:
ID, kgp11274425, kgp5732633, rs7077812
1      CT           GT           CA
2      CC           TG           AA
3      AA           GG           CG
4      GG           CC           AA

Can Anyone please help me with Python Code, all support & help needed

CodePudding user response：

Here you go:

import pandas as pd
df = pd.DataFrame({
    'ID' : [1,2,3,4],
    'kgp11274425_M' : ['C','C','A','G'], 
    'kgp11274425_F' : ['T','C','A','G'],
    'kgp5732633_M' : ['G','T','G','C'], 
    'kgp5732633_F' : ['T','G','G','C'], 
    'rs7077812_M' : ['C','A','C','A'], 
    'rs7077812_F' : ['A','A','G','A']
})
print(df)
from collections import defaultdict
data = defaultdict(list)
def mergeMF(row):
    data['ID']  = [row['ID']]
    cols = defaultdict(list)
    for colName in row.index:
        if colName == 'ID':
            continue
        newColName = colName[:-2]
        isMale = colName[-1] == 'M'
        cols[newColName] = [row[colName] if isMale else (cols[newColName][0] if len(cols[newColName]) else None), row[colName] if not isMale else (cols[newColName][1] if len(cols[newColName]) else None)]
    for k, v in cols.items():
        data[k]  = [''.join(v)]

df.apply(mergeMF, axis = 1)
df2 = pd.DataFrame(data)
print(df2)

Output

   ID kgp11274425_M kgp11274425_F kgp5732633_M kgp5732633_F rs7077812_M rs7077812_F
0   1             C             T            G            T           C           A
1   2             C             C            T            G           A           A
2   3             A             A            G            G           C           G
3   4             G             G            C            C           A           A
   ID kgp11274425 kgp5732633 rs7077812
0   1          CT         GT        CA
1   2          CC         TG        AA
2   3          AA         GG        CG
3   4          GG         CC        AA

CodePudding user response：

A simplified example:

data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1,       C,                T,            G,             T,                
2,       C,                C,            T,             G,              
3,       A,                A,            G,             G,              
4,       G,                G,            C,             C""")
df = pd.read_csv(data, sep=",")

cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
       ' kgp5732633_F']

df = df[cols]
df = df.set_index('ID')

unique_cols = pd.Index(map(lambda x : str(x)[:-2], df.columns)).unique()

results = []
columns = []
for col in unique_cols:
    my_cols = [x for x in df.columns if x.startswith(col)]
    results.append(df[my_cols].sum(axis=1).values)
    columns.append(col)

new_df = pd.DataFrame(results).T
new_df.columns = columns

This returns:

    kgp11274425 kgp5732633
0   CT          GT
1   CC          TG
2   AA          GG
3   GG          CC

CodePudding user response：

Split your columns into a multiindex: (kgp11274425, M), (kgp11274425, F), (kgp5732633, M), (kgp5732633, F), ... then combine the columns together:

result = df.set_index("ID")
result.columns = pd.MultiIndex.from_tuples(result.columns.to_series().str.split("_").apply(tuple))
result = result.xs("M", axis=1, level=1)   result.xs("F", axis=1, level=1)