Home > Net >  Get first instance multi-index and retaining index values
Get first instance multi-index and retaining index values

Time:06-13

I have a df with a multi-index. Both IDX1 and IDX2 have duplicate values ('C' and 'Y' respectively), and I want to retain the first instance of both and keep the index value; df_desired is what I want the result to look like.

I've tried df_tst = df.groupby(level=[0,2]).first() but this drops IDX2 from the df so I have no way of removing the duplicate from IDX1.

# starting data
import pandas as pd
data = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65],
['31/01/2021','C','Q',40,45],
['28/02/2021','C','Q',50,55],
['31/03/2021','C','Q',60,65],
['31/01/2021','D','Y',20,15],
['28/02/2021','D','Y',20,15],
['31/03/2021','D','Y',30,30]]

df=pd.DataFrame(data)
df.columns = ['DATE','IDX1','IDX2','VAR1','VAR2']
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index(['IDX1','IDX2','DATE'], inplace=True)

# target df:
df_desired = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65]]

CodePudding user response:

Use:

#get first indices per first and second level with remove last level of MultiIndex
df1 = df.droplevel(-1)
idx1 = df1.groupby(level=0).head(1).index
idx2 = df1.groupby(level=1).head(1).index

#compare with intersecton of both levels
df = df[df1.index.isin(idx1.intersection(idx2))]
print (df)
                      VAR1  VAR2
IDX1 IDX2 DATE                  
A    X    2021-01-31    10    15
          2021-02-28    20    30
          2021-03-31    30    45
B    Y    2021-01-31    20    15
          2021-02-28    20    15
          2021-03-31    30    30
C    Z    2021-01-31    40    45
          2021-02-28    50    55
          2021-03-31    60    65

Or:

#test duplicated values by first and second level with helper df1 DataFrame
df1 = df.index.to_frame()
df2 = df.droplevel(-1)
idx = df2.index[~df1.duplicated(['IDX1']) & ~df1.duplicated(['IDX2'])]

df = df[df2.index.isin(idx)]
print (df)
                      VAR1  VAR2
IDX1 IDX2 DATE                  
A    X    2021-01-31    10    15
          2021-02-28    20    30
          2021-03-31    30    45
B    Y    2021-01-31    20    15
          2021-02-28    20    15
          2021-03-31    30    30
C    Z    2021-01-31    40    45
          2021-02-28    50    55
          2021-03-31    60    65
  • Related