Context: I have a dataframe that contains duplicated indices, howerver, I'd only like to drop the duplicated indices that contain a specific pattern
For example if we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]}, index=['Sum_2019','X','Y','Sum_2020','Sum_2020','A','B','C','Sum_2021','Sum_2021'])
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
Sum_2020 10
X 2
Y 3
Z 4
Sum_2021 20
Sum_2021 20
Desired output: How can I drop only the indices that contain the pattern "Sum_" or each given year (2020 repeated) from the dataframe?
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
X 2
Y 3
C 4
Sum_2021 20
Attempts:
I was trying to do this:
df= df[~df.index.duplicated(keep='first')]
But this also removes the indices "X" and "Y" that I want to keep.
Thank you!
CodePudding user response:
Add condition for keep indices not starting by Sum_
and chain by &
for bitwise AND
:
df = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]},
index=['Sum_2019','X','Y','Sum_2020','Sum_2020',
'X','Y','Z','Sum_2021','Sum_2021'])
df = df[~df.index.duplicated(keep='first') | ~df.index.str.startswith('Sum_')]
print (df)
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
X 2
Y 3
Z 4
Sum_2021 20
CodePudding user response:
You can use shift
to compare successive label:
# identify "Sum_" indices
s = tt.index.to_series().str.contains('Sum_')
# keep those that are not Sum_ if the previous is Sum_
out = tt[~(s&s.shift())]
output:
A
Sum_2019 1
X 2
Y 3
Sum_2020 10
A 2
B 3
C 4
Sum_2021 20