combining repeating columns in a dataframe-CodePudding

I have a dataframe where I have a series of repeating columns. How can I just combine them so that they are just one set of individual columns? I have tried using df.melt but I need to specify columns in one of the parameters. The dataframe has an a lot of columns so typing them out individually would not work. It seems like this is a simple fix but I just cannot figure it out. Can anyone help? A sample dataframe is below, I added a period before or after the ch in the column name so it can let me recreate the smaller version of my dataframe.

df3= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm','3':'E11_1_nucleus' ,'4':'E11_1_cytoplasm'},
'Area_ch1.': {'1': 435,'2':635,'3': 105,'4':850},
'Area_ch1': {'1': 135,'2':605,'3': 158,'4':970},
'Mean_ch2': {'1': 313,'2':847,'3': 315,'4':850},
'Mean_ch2.': {'1': 150,'2':331,'3': 195,'4':130}})

Desired output:

df4= pd.DataFrame({
'Label': {'1':'E10_1_nucleus' ,'2':'E10_1_cytoplasm','3':'E11_1_nucleus' ,'4':'E11_1_cytoplasm','5':'E10_1_nucleus' ,'6':'E10_1_cytoplasm','7':'E11_1_nucleus' ,'8':'E11_1_cytoplasm'},
'Area_ch1': {'1': 435,'2':635,'3': 105,'4':850,'5': 135,'6':605,'7': 158,'8':970}, 
'Mean_ch2': {'1': 313,'2':847,'3': 315,'4':850,'5': 150,'6':331,'7': 195,'8':130}})

CodePudding user response：

Try this:


import difflib
import pandas as pd


df3 = pd.DataFrame(
    {
        "Label": {
            "1": "E10_1_nucleus",
            "2": "E10_1_cytoplasm",
            "3": "E11_1_nucleus",
            "4": "E11_1_cytoplasm",
        },
        "Area_.ch1": {"1": 435, "2": 635, "3": 105, "4": 850},
        "Area_ch1.": {"1": 275, "2": 600, "3": 175, "4": 300},
        "Area_ch1": {"1": 135, "2": 605, "3": 158, "4": 970},
        "Mean": {"1": 313, "2": 847, "3": 315, "4": 850},
        "StdDev": {"1": 150, "2": 331, "3": 195, "4": 130},
    }
)


for column in df3.columns:
    matches = difflib.get_close_matches(column, df3.columns, n=df3.shape[1])
    if len(matches) > 1:
        df3 = pd.concat(
            [
                df3[[*df3.columns.difference(matches), repeated_col]].rename(
                    columns={repeated_col: matches[0]}
                )
                for repeated_col in matches
            ]
        )
df3
# Returns:
#
#              Label  Mean  StdDev  Area_.ch1
# 1    E10_1_nucleus   313     150        435
# 2  E10_1_cytoplasm   847     331        635
# 3    E11_1_nucleus   315     195        105
# 4  E11_1_cytoplasm   850     130        850
# 1    E10_1_nucleus   313     150        135
# 2  E10_1_cytoplasm   847     331        605
# 3    E11_1_nucleus   315     195        158
# 4  E11_1_cytoplasm   850     130        970
# 1    E10_1_nucleus   313     150        275
# 2  E10_1_cytoplasm   847     331        600
# 3    E11_1_nucleus   315     195        175
# 4  E11_1_cytoplasm   850     130        300

Notes

difflib.get_close_matches will try to find values similar to the column name. It has a parameter named cutoff that accepts values between 0 and 1 that you can use to control how strict this match needs to be. In other words, increase this parameter’s value, to decrease the chance of different column names being matched.

CodePudding user response：

Let's try pd.wide_to_long after renaming your df3 column header properly

out = (pd.wide_to_long(df3.rename(columns={'Area_ch1.': 'Area_ch2', 'Mean_ch2.': 'Mean_ch1'}),
                      ['Area_ch', 'Mean_ch'], i='Label', j='test')
       .droplevel(level=-1)
       .reset_index())

print(out)

             Label  Area_ch  Mean_ch
0    E10_1_nucleus      435      313
1  E10_1_cytoplasm      635      847
2    E11_1_nucleus      105      315
3  E11_1_cytoplasm      850      850
4    E10_1_nucleus      135      150
5  E10_1_cytoplasm      605      331
6    E11_1_nucleus      158      195
7  E11_1_cytoplasm      970      130