I have a dataframe where I have a series of repeating columns. How can I just combine them so that they are just one set of individual columns? I have tried using df.melt but I need to specify columns in one of the parameters. The dataframe has an a lot of columns so typing them out individually would not work. It seems like this is a simple fix but I just cannot figure it out. Can anyone help? A sample dataframe is below, I added a period before or after the ch in the column name so it can let me recreate the smaller version of my dataframe.
df3= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm','3':'E11_1_nucleus' ,'4':'E11_1_cytoplasm'},
'Area_ch1.': {'1': 435,'2':635,'3': 105,'4':850},
'Area_ch1': {'1': 135,'2':605,'3': 158,'4':970},
'Mean_ch2': {'1': 313,'2':847,'3': 315,'4':850},
'Mean_ch2.': {'1': 150,'2':331,'3': 195,'4':130}})
Desired output:
df4= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm','3':'E11_1_nucleus' ,'4':'E11_1_cytoplasm','5':'E10_1_nucleus' ,'6':'E10_1_cytoplasm','7':'E11_1_nucleus' ,'8':'E11_1_cytoplasm'},
'Area_ch1': {'1': 435,'2':635,'3': 105,'4':850,'5': 135,'6':605,'7': 158,'8':970},
'Mean_ch2': {'1': 313,'2':847,'3': 315,'4':850,'5': 150,'6':331,'7': 195,'8':130}})
CodePudding user response:
Try this:
import difflib
import pandas as pd
df3 = pd.DataFrame(
{
"Label": {
"1": "E10_1_nucleus",
"2": "E10_1_cytoplasm",
"3": "E11_1_nucleus",
"4": "E11_1_cytoplasm",
},
"Area_.ch1": {"1": 435, "2": 635, "3": 105, "4": 850},
"Area_ch1.": {"1": 275, "2": 600, "3": 175, "4": 300},
"Area_ch1": {"1": 135, "2": 605, "3": 158, "4": 970},
"Mean": {"1": 313, "2": 847, "3": 315, "4": 850},
"StdDev": {"1": 150, "2": 331, "3": 195, "4": 130},
}
)
for column in df3.columns:
matches = difflib.get_close_matches(column, df3.columns, n=df3.shape[1])
if len(matches) > 1:
df3 = pd.concat(
[
df3[[*df3.columns.difference(matches), repeated_col]].rename(
columns={repeated_col: matches[0]}
)
for repeated_col in matches
]
)
df3
# Returns:
#
# Label Mean StdDev Area_.ch1
# 1 E10_1_nucleus 313 150 435
# 2 E10_1_cytoplasm 847 331 635
# 3 E11_1_nucleus 315 195 105
# 4 E11_1_cytoplasm 850 130 850
# 1 E10_1_nucleus 313 150 135
# 2 E10_1_cytoplasm 847 331 605
# 3 E11_1_nucleus 315 195 158
# 4 E11_1_cytoplasm 850 130 970
# 1 E10_1_nucleus 313 150 275
# 2 E10_1_cytoplasm 847 331 600
# 3 E11_1_nucleus 315 195 175
# 4 E11_1_cytoplasm 850 130 300
Notes
difflib.get_close_matches
will try to find values similar to the column name. It has a parameter named cutoff
that accepts values between 0 and 1 that you can use to control how strict this match needs to be. In other words, increase this parameter’s value, to decrease the chance of different column names being matched.
CodePudding user response:
Let's try pd.wide_to_long
after renaming your df3
column header properly
out = (pd.wide_to_long(df3.rename(columns={'Area_ch1.': 'Area_ch2', 'Mean_ch2.': 'Mean_ch1'}),
['Area_ch', 'Mean_ch'], i='Label', j='test')
.droplevel(level=-1)
.reset_index())
print(out)
Label Area_ch Mean_ch
0 E10_1_nucleus 435 313
1 E10_1_cytoplasm 635 847
2 E11_1_nucleus 105 315
3 E11_1_cytoplasm 850 850
4 E10_1_nucleus 135 150
5 E10_1_cytoplasm 605 331
6 E11_1_nucleus 158 195
7 E11_1_cytoplasm 970 130