Home > Back-end >  Custom Aggregation Across Parallel Hierarchy Levels in a Multi-Index
Custom Aggregation Across Parallel Hierarchy Levels in a Multi-Index

Time:01-31

I have a dataframe that is organised hierarchically. Consider this:

                                         baseval
indexlevel0 indexlevel1 indexlevel2         
L0-0        L1-0        L2-0               1
                        L2-1               1
                        L2-2              20
            L1-1        L2-0               2
                        L2-1               2
                        L2-2              10

What I need to do is create a new dataframe, that collapses the intermediate level (indexlevel1) by replacing the corresponding (indexlevel2) with a single value that is the minimum of the two levels that were once contained in indexlevel1.

Probably easier to just show what I mean - the solution to the above example would be (that is, indexlevel0, and 2 are preserved, along with the minimum basevals per-indexlevel2):

                         minbylevel
indexlevel0 indexlevel2            
L0-0        L2-0                  1
            L2-1                  1
            L2-2                 10

I have not the slightest idea of where to even start with this. All the examples of aggregation etc work from the bottom up.

Here's some test code to create the starting point dataframe.

import pandas as pd
from io import StringIO

testdata = """
indexlevel0,indexlevel1,indexlevel2,baseval
L0-0,L1-0,L2-0,1
L0-0,L1-0,L2-1,1
L0-0,L1-0,L2-2,20
L0-0,L1-1,L2-0,2
L0-0,L1-1,L2-1,2
L0-0,L1-1,L2-2,10
"""
testinput = StringIO(testdata)

data_df = pd.read_csv(testinput, index_col=[0,1,2], header=[0]).sort_index()
print(data_df)

CodePudding user response:

If only one column DataFrame is possible aggregate min per first and third level of MultiIndex:

df = data_df.groupby(level=[0,2]).min()
print (df)
                         baseval
indexlevel0 indexlevel2         
L0-0        L2-0               1
            L2-1               1
            L2-2              10

Or if there is multiple columns use DataFrameGroupBy.idxmin with DataFrame.loc and MultiIndex.droplevel:

df = data_df.loc[data_df.groupby(level=[0,2])['baseval'].idxmin()].droplevel(1)
print (df)
                         baseval
indexlevel0 indexlevel2         
L0-0        L2-0               1
            L2-1               1
            L2-2              10

CodePudding user response:

Oh ... it just dawned on me after posting that I could make the problem 'bottom up' by simply swapping levels. After doing that I can treat it as a standard aggregation, and just do a groupby -> min over it.

from pathlib import Path
import pandas as pd
from io import StringIO

testdata = """
indexlevel0,indexlevel1,indexlevel2,baseval
L0-0,L1-0,L2-0,1
L0-0,L1-0,L2-1,1
L0-0,L1-0,L2-2,20
L0-0,L1-1,L2-0,2
L0-0,L1-1,L2-1,2
L0-0,L1-1,L2-2,10
"""
testinput = StringIO(testdata)

data_df = pd.read_csv(testinput, index_col=[0,1,2], header=[0]).sort_index()
print(data_df)


# swaplevel swaps the specified level (1) with the last level in the
# index hierarchy. The sort index fixes any out of order indices
# which mess with pandas performance.
s = data_df.swaplevel(1).sort_index().groupby(level=[0,1]).min()
print(s)

output

                                     baseval
indexlevel0 indexlevel1 indexlevel2         
L0-0        L1-0        L2-0               1
                        L2-1               1
                        L2-2              20
            L1-1        L2-0               2
                        L2-1               2
                        L2-2              10
                         baseval
indexlevel0 indexlevel2         
L0-0        L2-0               1
            L2-1               1
            L2-2              10

This appears to be relatively fast (thank you so much for any alternate answers):

from pathlib import Path
import pandas as pd
from io import StringIO
import timeit

testdata = """
indexlevel0,indexlevel1,indexlevel2,baseval
L0-0,L1-0,L2-0,1
L0-0,L1-0,L2-1,1
L0-0,L1-0,L2-2,20
L0-0,L1-1,L2-0,2
L0-0,L1-1,L2-1,2
L0-0,L1-1,L2-2,10
"""

for i in range(1, 10000):
    testdata  = f"""
L0-{i},L1-0,L2-0,1
L0-{i},L1-0,L2-1,1
L0-{i},L1-0,L2-2,20
L0-{i},L1-1,L2-0,2
L0-{i},L1-1,L2-1,2
L0-{i},L1-1,L2-2,10
"""

testinput = StringIO(testdata)

data_df = pd.read_csv(testinput, index_col=[0,1,2], header=[0]).sort_index()
print(data_df)

def test_group():
    s = data_df.groupby(level=[0,2]).min()
    print(s)

def test_swaplevel():
    s = data_df.swaplevel(1).sort_index().groupby(level=[0,1]).min()
    print(s)

def test_loc():
    s = data_df.loc[data_df.groupby(level=[0,2])['baseval'].idxmin()].droplevel(1)
    print(s)

t = timeit.timeit('test_group()', globals = globals(), number=1)
print(f"{t} for test_group")

t = timeit.timeit('test_swaplevel()', globals = globals(), number=1)
print(f"{t} for swaplevel")

t = timeit.timeit('test_loc()', globals = globals(), number=1)
print(f"{t} for loc")
0.025066290999999907 for test_group
0.03703514200000013 for swaplevel
10.385111478999999 for loc
  • Related