Build multi-indexed dataframe from row pairs-CodePudding

I'm trying to find a good way to build a multi-index dataframe from tuples of paired rows. Here's an example of the pairs.

MultiIndex([(0, 4),
            (1, 4),
            (2, 4),
            (3, 1),
            (3, 2),
            (4, 0),
            (4, 3)],
           )

This question is generalized to any dataframe with multiple rows, but if it helps to visualize here's an example dataframe

          a         b         c         d         e
0  0.335356  0.496102  0.961942  0.624446  0.513594
1  0.775582  0.000616  0.255642  0.209872  0.605989
2  0.379600  0.847778  0.840465  0.321873  0.430256
3  0.221040  0.004871  0.392034  0.474216  0.334206
4  0.064554  0.148257  0.227480  0.083865  0.614093

For example, with the above pairs and dataframe, build a single multi-indexed frame where the first level is the first pair and the second level are all the rows of the dataframe that are paired with the first level.

            0         1         2         3         4
0 4  0.064554  0.148257  0.227480  0.083865  0.614093
1 4  0.064554  0.148257  0.227480  0.083865  0.614093
2 4  0.064554  0.148257  0.227480  0.083865  0.614093
3 1  0.775582  0.000616  0.255642  0.209872  0.605989
  2  0.379600  0.847778  0.840465  0.321873  0.430256
4 0  0.335356  0.496102  0.961942  0.624446  0.513594
  3  0.221040  0.004871  0.392034  0.474216  0.334206

Best I can come up with is building a bunch of multi-indexed subframes and concatenating, but it seems too clunky for what is essentially a new view of the original dataframe.

multi_df = pd.DataFrame()
for lvl_0 in pairs.get_level_values(0).unique():
    lvl_1 = pairs.get_level_values(1)[pairs.get_loc(lvl_0)]
    local_df = pd.DataFrame(data=df.loc[lvl_1].values, index=pd.MultiIndex.from_product([[lvl_0], lvl_1]))
    multi_df = pd.concat((multi_df, local_df), axis=0, join='outer')

By the way, even though there are duplicate rows in this random toy example in my real world situation the level 1 indices have very little duplicates with each other and I'm not really worried about wasted resources. Anyway, I mostly want to use this newly constructed multi-index frame for visualization/manipulation and not for serious number crunching.

Edit: In case it's helpful here's some builder functions for testing.

import pandas as pd
import random

def build_random_df(seed_val=242, nrows=5, ncols=5):
    random.seed(seed_val)
    columns = [chr(i) for i in range(97, 97   ncols)]
    data = [[random.random() for j in range(ncols)] for i in range(nrows)]
    df = pd.DataFrame(data=data, columns=columns)
    return df

def build_random_pairs(seed_val=314, nrows=5):
    random.seed(seed_val)
    row_idx = list(range(nrows))
    fake_matches = []
    for j in range(nrows):
        random.shuffle(row_idx)
        pick_n = random.randint(2, nrows) // 2
        fake_matches.extend(sorted([(j, i) for i in row_idx[:pick_n] if i!=j], key=lambda x: x[1]))
    pairs = pd.MultiIndex.from_tuples(fake_matches)
    return pairs

df = build_random_df()
pairs = build_random_pairs()

CodePudding user response：

Not sure if this is what you have in mind:

Select the dataframe with the last level, and append the first level to final dataframe:

index = pd.MultiIndex.from_tuples([(0, 4),
            (1, 4),
            (2, 4),
            (3, 1),
            (3, 2),
            (4, 0),
            (4, 3)],
           )

a = index.get_level_values(0)
b = index.get_level_values(-1)

(df.loc[b]
   .set_index(a, append = True)
   .swaplevel()
   .set_axis(range(df.columns.size), axis='columns')
)

            0         1         2         3         4
0 4  0.064554  0.148257  0.227480  0.083865  0.614093
1 4  0.064554  0.148257  0.227480  0.083865  0.614093
2 4  0.064554  0.148257  0.227480  0.083865  0.614093
3 1  0.775582  0.000616  0.255642  0.209872  0.605989
  2  0.379600  0.847778  0.840465  0.321873  0.430256
4 0  0.335356  0.496102  0.961942  0.624446  0.513594
  3  0.221040  0.004871  0.392034  0.474216  0.334206