Home > front end >  Manipulating multi-index dataframe with repeating index
Manipulating multi-index dataframe with repeating index

Time:03-03

Let's have such a structure:

from datetime import date, timedelta
from pandas import DataFrame as df
import numpy as np

idx1 = []
idx2 = []
idx3 = []
for i in range(3):
    idx1.append(date.today() - timedelta(days=0))
    idx2.append(date.today() - timedelta(days=1))
    idx3.append(date.today() - timedelta(days=2))

data1 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data2 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}
data3 = {"Sasquach": np.random.uniform(1, 10, 2), "Furby":np.random.uniform(1, 10, 2), "Ant":np.random.uniform(1, 10, 2)}

my_dataframe_1 = df.from_dict(data1, orient="index", columns=["pretty", "brave"])
my_dataframe_2 = df.from_dict(data2, orient="index", columns=["pretty", "brave"])
my_dataframe_3 = df.from_dict(data3, orient="index", columns=["pretty", "brave"])
my_dataframe_1["timestamp"] = idx1
my_dataframe_2["timestamp"] = idx2
my_dataframe_3["timestamp"] = idx3

ultimate_df = my_dataframe_1.append(my_dataframe_2.append(my_dataframe_3))
ultimate_df.sort_values(["timestamp", "pretty"], ascending=[True, False], inplace=True)
ultimate_df.reset_index(inplace=True)
ultimate_df.set_index(["timestamp", "timestamp"], inplace=True)

print(ultimate_df)

This gives us:

                          index    pretty     brave
timestamp  timestamp                               
2022-02-28 2022-02-28     Furby  6.083493  8.383633
           2022-02-28  Sasquach  3.454873  6.426673
           2022-02-28       Ant  1.279582  9.647796
2022-03-01 2022-03-01     Furby  9.667125  3.462951
           2022-03-01       Ant  3.443364  5.457242
           2022-03-01  Sasquach  3.364245  5.190403
2022-03-02 2022-03-02       Ant  2.773309  4.708483
           2022-03-02     Furby  2.765552  2.065672
           2022-03-02  Sasquach  2.347767  7.956183

My question is, is there any easy way for manipulating such structured data, where the index repeats for every item in "index" column? My goal is to easily select item from "index" column with the highest pretty value for the most recent date (I was thinking it would be possible with ultimate_df.iloc[-1].iloc[0] but it's not the case. The multi-index is not necessary here, it's the approach I tried using, but when I slice it like this:

print(ultimate_df.iloc[-1])

The result is as below, so it's the row of "inner" index.

index     Sasquach
pretty    2.347767
brave     7.956183
Name: (2022-03-02, 2022-03-02), dtype: object

Do you have any advise how to approach it?. Maybe it's possible to have only that outer-index (that kinda groups, doesn't repeat for all the items) that when doing .iloc[-1] would print the most recent timestamp with those 3 items? I really dislike df.groupby() and I don't want to create separate dataframes for every item in "index" column. My only idea is using such multi-index and slicing like that:

print(ultimate_df.loc[ultimate_df.iloc[-1].name]) (this print just for your convinience)
print(ultimate_df.loc[ultimate_df.iloc[-1].name].iloc[0])

To receive the item with highest "pretty" value of the most recent timestamp/date but it looks very complicated.

Result:

                          index    pretty     brave
timestamp  timestamp                               
2022-03-02 2022-03-02       Ant  2.773309  4.708483
           2022-03-02     Furby  2.765552  2.065672
           2022-03-02  Sasquach  2.347767  7.956183

index          Ant
pretty    2.773309
brave     4.708483
Name: (2022-03-02, 2022-03-02), dtype: object

EDIT: For future generations, some alternative is creating a dict of dataframes to help manipulate such data with repeating items:

data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names == key]

CodePudding user response:

Trying resetting the MultiIndex and filter to get the required row:

df = ultimate_df.droplevel(0).reset_index()

>>> df.loc[df[df["timestamp"].eq(df["timestamp"].max())]["pretty"].idxmax()]
timestamp    2022-03-02
index             Furby
pretty         8.162101
brave          1.038208
Name: 6, dtype: object
  • Related