I have a df with a multi-index. Both IDX1 and IDX2 have duplicate values ('C' and 'Y' respectively), and I want to retain the first instance of both and keep the index value; df_desired is what I want the result to look like.
I've tried df_tst = df.groupby(level=[0,2]).first()
but this drops IDX2 from the df so I have no way of removing the duplicate from IDX1.
# starting data
import pandas as pd
data = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65],
['31/01/2021','C','Q',40,45],
['28/02/2021','C','Q',50,55],
['31/03/2021','C','Q',60,65],
['31/01/2021','D','Y',20,15],
['28/02/2021','D','Y',20,15],
['31/03/2021','D','Y',30,30]]
df=pd.DataFrame(data)
df.columns = ['DATE','IDX1','IDX2','VAR1','VAR2']
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index(['IDX1','IDX2','DATE'], inplace=True)
# target df:
df_desired = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65]]
CodePudding user response:
Use:
#get first indices per first and second level with remove last level of MultiIndex
df1 = df.droplevel(-1)
idx1 = df1.groupby(level=0).head(1).index
idx2 = df1.groupby(level=1).head(1).index
#compare with intersecton of both levels
df = df[df1.index.isin(idx1.intersection(idx2))]
print (df)
VAR1 VAR2
IDX1 IDX2 DATE
A X 2021-01-31 10 15
2021-02-28 20 30
2021-03-31 30 45
B Y 2021-01-31 20 15
2021-02-28 20 15
2021-03-31 30 30
C Z 2021-01-31 40 45
2021-02-28 50 55
2021-03-31 60 65
Or:
#test duplicated values by first and second level with helper df1 DataFrame
df1 = df.index.to_frame()
df2 = df.droplevel(-1)
idx = df2.index[~df1.duplicated(['IDX1']) & ~df1.duplicated(['IDX2'])]
df = df[df2.index.isin(idx)]
print (df)
VAR1 VAR2
IDX1 IDX2 DATE
A X 2021-01-31 10 15
2021-02-28 20 30
2021-03-31 30 45
B Y 2021-01-31 20 15
2021-02-28 20 15
2021-03-31 30 30
C Z 2021-01-31 40 45
2021-02-28 50 55
2021-03-31 60 65