I'm trying to find a good way to build a multi-index dataframe from tuples of paired rows. Here's an example of the pairs.
MultiIndex([(0, 4),
(1, 4),
(2, 4),
(3, 1),
(3, 2),
(4, 0),
(4, 3)],
)
This question is generalized to any dataframe with multiple rows, but if it helps to visualize here's an example dataframe
a b c d e
0 0.335356 0.496102 0.961942 0.624446 0.513594
1 0.775582 0.000616 0.255642 0.209872 0.605989
2 0.379600 0.847778 0.840465 0.321873 0.430256
3 0.221040 0.004871 0.392034 0.474216 0.334206
4 0.064554 0.148257 0.227480 0.083865 0.614093
For example, with the above pairs and dataframe, build a single multi-indexed frame where the first level is the first pair and the second level are all the rows of the dataframe that are paired with the first level.
0 1 2 3 4
0 4 0.064554 0.148257 0.227480 0.083865 0.614093
1 4 0.064554 0.148257 0.227480 0.083865 0.614093
2 4 0.064554 0.148257 0.227480 0.083865 0.614093
3 1 0.775582 0.000616 0.255642 0.209872 0.605989
2 0.379600 0.847778 0.840465 0.321873 0.430256
4 0 0.335356 0.496102 0.961942 0.624446 0.513594
3 0.221040 0.004871 0.392034 0.474216 0.334206
Best I can come up with is building a bunch of multi-indexed subframes and concatenating, but it seems too clunky for what is essentially a new view of the original dataframe.
multi_df = pd.DataFrame()
for lvl_0 in pairs.get_level_values(0).unique():
lvl_1 = pairs.get_level_values(1)[pairs.get_loc(lvl_0)]
local_df = pd.DataFrame(data=df.loc[lvl_1].values, index=pd.MultiIndex.from_product([[lvl_0], lvl_1]))
multi_df = pd.concat((multi_df, local_df), axis=0, join='outer')
By the way, even though there are duplicate rows in this random toy example in my real world situation the level 1 indices have very little duplicates with each other and I'm not really worried about wasted resources. Anyway, I mostly want to use this newly constructed multi-index frame for visualization/manipulation and not for serious number crunching.
Edit: In case it's helpful here's some builder functions for testing.
import pandas as pd
import random
def build_random_df(seed_val=242, nrows=5, ncols=5):
random.seed(seed_val)
columns = [chr(i) for i in range(97, 97 ncols)]
data = [[random.random() for j in range(ncols)] for i in range(nrows)]
df = pd.DataFrame(data=data, columns=columns)
return df
def build_random_pairs(seed_val=314, nrows=5):
random.seed(seed_val)
row_idx = list(range(nrows))
fake_matches = []
for j in range(nrows):
random.shuffle(row_idx)
pick_n = random.randint(2, nrows) // 2
fake_matches.extend(sorted([(j, i) for i in row_idx[:pick_n] if i!=j], key=lambda x: x[1]))
pairs = pd.MultiIndex.from_tuples(fake_matches)
return pairs
df = build_random_df()
pairs = build_random_pairs()
CodePudding user response:
Not sure if this is what you have in mind:
Select the dataframe with the last level, and append the first level to final dataframe:
index = pd.MultiIndex.from_tuples([(0, 4),
(1, 4),
(2, 4),
(3, 1),
(3, 2),
(4, 0),
(4, 3)],
)
a = index.get_level_values(0)
b = index.get_level_values(-1)
(df.loc[b]
.set_index(a, append = True)
.swaplevel()
.set_axis(range(df.columns.size), axis='columns')
)
0 1 2 3 4
0 4 0.064554 0.148257 0.227480 0.083865 0.614093
1 4 0.064554 0.148257 0.227480 0.083865 0.614093
2 4 0.064554 0.148257 0.227480 0.083865 0.614093
3 1 0.775582 0.000616 0.255642 0.209872 0.605989
2 0.379600 0.847778 0.840465 0.321873 0.430256
4 0 0.335356 0.496102 0.961942 0.624446 0.513594
3 0.221040 0.004871 0.392034 0.474216 0.334206