I want to create a new dataframe with columns calculated as means of columns with similar names from this dataframe:
B6_i B6_ii B6_iii ... BXD80_i BXD80_ii BXD81_i
data ...
Cd38 0.598864 -0.225322 0.306926 ... -0.312190 0.281429 0.424752
Trim21 1.947399 2.920681 2.805861 ... 1.469634 2.103585 0.827487
Kpnb1 -0.458240 -0.417507 -0.441522 ... -0.314313 -0.153509 -0.095863
Six1 1.055255 0.868148 1.012298 ... 0.142565 0.264753 0.807692
The new dataframe should look like this:
B6 BXD80 ... BXD81
data
Cd38 -0.041416 -0.087859 ... 0.424752
Trim21 15.958981 3.091500 ... 0.827487
Kpnb1 -0.084471 0.048250 ... -0.095863
Six1 0.927383 0.037745 ... 0.807692
(like (B6_i B6_ii B6_iii)/3), based on all characters until the underscore "_")
Some columns are one of n columns, and others are singular (like 'BXD81_i'), so I need a method that can work with a varying number for each mean calculation.
Sorry for writing so long, many thanks in advance ! :)
CodePudding user response:
You can aggregate mean
per columns by values before _
:
df.columns = df.columns.str.split('_', expand=True)
df1 = df.groupby(level=0, axis=1).mean()
Or:
df1 = df.groupby(lambda x: x.split('_')[0], axis=1).mean()
print (df1)
B6 BXD80 BXD81
data
Cd38 0.226823 -0.015381 0.424752
Trim21 2.557980 1.786609 0.827487
Kpnb1 -0.439090 -0.233911 -0.095863
Six1 0.978567 0.203659 0.807692