The original Data looks like
ID, kgp11274425_A, kgp11274425_HET, kgp5732633_C, kgp5732633_HET, rs707_G, rs707_HET, kgp75_T, kgp75_HET
1 C T G T C A 0 0
2 C C T G A A G T
3 A A G G C G A A
4 G G C C A A T A
Note:
- Like above I have 522 rows and 369 Columns ( of individual Mother & Father SNP values) which I need to combine
- The length of each SNP is different (ex: kgp11274425 & rs707)
I'm working with GWAS data, these were the SNPs id of our cell containing One Mother Chromosome and Father Chromosome. I would like to combine individual SNP values of M & F into One and Name it with SNP ID (kgp11274425_A kgp11274425_HET = kgp11274425)
Desired Output:
ID, kgp11274425 kgp5732633 rs707 kgp75
1 CT GT CA 00
2 CC TG AA GT
3 AA GG CG AA
4 GG CC AA TA
Can Anyone please help me, all support & help needed
CodePudding user response:
Just change the unique_cols
in the following way:
from io import StringIO
data = StringIO("""ID, kgp11274425_M, kgp11274425_F, kgp5732633_M, kgp5732633_F,
1, C, T, G, T,
2, C, C, T, G,
3, A, A, G, G,
4, G, G, C, C""")
df = pd.read_csv(data, sep=",")
cols = ['ID', ' kgp11274425_M', ' kgp11274425_F', ' kgp5732633_M',
' kgp5732633_F']
df = df[cols]
df = df.set_index('ID')
#here######################################################
sep = '_'
unique_cols = pd.Index(map(lambda x : x.split(sep, 1)[0], df.columns)).unique()
#here######################################################
results = []
columns = []
for col in unique_cols:
my_cols = [x for x in df.columns if x.startswith(col)]
results.append(df[my_cols].sum(axis=1).values)
columns.append(col)
new_df = pd.DataFrame(results).T
new_df.columns = columns
CodePudding user response:
One option is with pandas' groupby, followed by an iteration:
temp = df.set_index('ID')
wrapper = temp.groupby(temp.columns.str.split('_').str[0], axis = 1)
wrapper.sum().reset_index()
# if you want to use a delimiter, you can try
# wrapper = wrapper.apply(lambda x: x.iloc[:, 0].str.cat(x.iloc[:, 1:]), sep=',')
ID kgp11274425 kgp5732633 kgp75 rs707
0 1 CT GT 00 CA
1 2 CC TG GT AA
2 3 AA GG AA CG
3 4 GG CC TA AA