How to Add Two Columns of DataFrame and Rename it with Prefix Name-CodePudding

The original Data looks like

ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1       C                T            G             T              C            A       0          0
2       C                C            T             G              A            A       G          T
3       A                A            G             G              C            G       A          A
4       G                G            C             C              A            A       T          A

Note:

Like above I have 522 rows and 369 Columns ( of individual Mother & Father SNP values) which I need to combine
The length of each SNP is different (ex: kgp11274425 & rs707)

I'm working with GWAS data, these were the SNPs id of our cell containing One Mother Chromosome and Father Chromosome. I would like to combine individual SNP values of M & F into One and Name it with SNP ID (kgp11274425_A kgp11274425_HET = kgp11274425)

Desired Output:

ID, kgp11274425  kgp5732633    rs707     kgp75
1      CT           GT           CA       00
2      CC           TG           AA       GT
3      AA           GG           CG       AA
4      GG           CC           AA       TA


Can Anyone please help me, all support & help needed

CodePudding user response：

Just change the unique_cols in the following way:

from io import StringIO
data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1,       C,                T,            G,             T,                
2,       C,                C,            T,             G,              
3,       A,                A,            G,             G,              
4,       G,                G,            C,             C""")
df = pd.read_csv(data, sep=",")

cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
       ' kgp5732633_F']

df = df[cols]
df = df.set_index('ID')
#here######################################################
sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique() 
#here######################################################

results = []
columns = []
for col in unique_cols:
    my_cols = [x for x in df.columns if x.startswith(col)]
    results.append(df[my_cols].sum(axis=1).values)
    columns.append(col)

new_df = pd.DataFrame(results).T
new_df.columns = columns

CodePudding user response：

One option is with pandas' groupby, followed by an iteration:

temp = df.set_index('ID')
wrapper = temp.groupby(temp.columns.str.split('_').str[0], axis = 1)
wrapper.sum().reset_index()
# if you want to use a delimiter, you can try
# wrapper = wrapper.apply(lambda x: x.iloc[:, 0].str.cat(x.iloc[:, 1:]), sep=',')

   ID kgp11274425 kgp5732633 kgp75 rs707
0   1          CT         GT    00    CA
1   2          CC         TG    GT    AA
2   3          AA         GG    AA    CG
3   4          GG         CC    TA    AA