Home > Software engineering >  How to summarize many columns into unique combinations
How to summarize many columns into unique combinations

Time:03-21

I have a dataframe with a list of equipment in the rows, and columns for a number of experiments which need the equipment structured as shown:

equipment  exp-1 exp-2 ... exp-n
equip-1    T     F         T
.          .     .         .
equip-n    T     F         F

The equipment names are strings, booleans define if the device is needed for a given experiment. I'm trying to condense all of the different equipment requirements for each experiment into common lists. The end goal being to generate list of equipment that an be used for multiple experiments in this set.

I managed to achieve this by doing the following steps:

  1. Transpose the dataframe and concatenate all the boolean fields together
  2. Get all unique instances of these concatenated values
  3. Iterate through the dataframe using .loc and assign the combined id to each row
  4. Drop duplicates.

The end result looks like this:

experiment equip-1 ... equip-n equip-concat combo
exp-1      T           T       T...T        0
exp-2      F           F       F...F        1
exp-3      T           F       T...F        2

I was then able to transpose this again and stick it back into the original dataframe. So I now have a boolean column for each combination showing if a piece of equipment belongs to it.

equipment  exp-1 exp-2 ... exp-n combo-1 ...combo-n
equip-1    T     F         T     T          F
.          .     .         .
equip-n    T     F         F     F          F

I also needed to get a list of all the experiments that belong in a given combination:

combo 1: exp-1, exp-2, exp4
combo 2: exp-3, exp-5
...

I achieved this by grouping on the combined data and printing it out.

Is there a cleaner way to get the results I need from the initial experiment definitions? My method seems a bit cludgy.

EDIT: The code for the initial steps, one thing is that my initial implementation treated the booleans as strings. A config in this case is the combined set of equipment I'm trying to come up with.

dfexp = df.copy().set_index("equipment").transpose()
dfexpagg = dftests.agg("".join, axis=1)
dfexpagg.name = "Agg Gear"
dfexp["aggregated_gear"] = dfexpagg


unique_configs = dfexp["aggregated_gear"].unique()
for x, config in enumerate(unique_configs):
    dfexp.loc[dfexp["aggregated_gear"] == config, "Config"] = str(x)

dfconfigs = dfexp.set_index("Config"), drop=True).drop("aggregated_gear", axis=1).drop_duplicates().transpose()

I made a highlighted example in excel, hopefully that helps clarify the end goal. The configs are the columns I'm trying to generate. enter image description here

CodePudding user response:

I think you don't need to stick it back to the original dataframe.

The end goal is like you described. So formatting-wise, the closest and clearest is using a dictionary/JSON structure. For example,

# A map of equip_concat to list of experiment names
combo_map = {
    'TFFTFTT...' : ['exp_1', 'exp_4', ...]
     },
    'TFTFTFT...' : ...
}

So just loop over your dataframe and print your dictionary.

combo_map = {}
for index, row in df.iterrows():
    equip_concat = row['equip-concat']
    if equip_concat not in combo_map:
        combo_map[equip_concat] = []
    combo_map[equip_concat].append(row['experiment'])

print(combo_map)

CodePudding user response:

Iterate over columns:

df = pd.DataFrame([['T', 'F', 'T','T'], ['F', 'F', 'T','F'], ['F', 'F','T','F']], index = ['eq1', 'eq2', 'eq3'], columns = ['ex1', 'ex2', 'ex3', 'ex4'])

output = {}
for i, col in enumerate(df.columns):
    temp = df.index[df[col] == 'T'].tolist()
    if temp not in output.values():
        output[f'combo-{i}'] = temp

Output:

{'combo-0': ['eq1'], 'combo-1': [], 'combo-2': ['eq1', 'eq2', 'eq3']}
  • Related