Drop duplicated indices/columns that only contain a sub-string from a dataframe-CodePudding

Context: I have a dataframe that contains duplicated indices, howerver, I'd only like to drop the duplicated indices that contain a specific pattern

For example if we have this dataframe:

tt = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]}, index=['Sum_2019','X','Y','Sum_2020','Sum_2020','A','B','C','Sum_2021','Sum_2021'])

           A
Sum_2019   1
X          2
Y          3
Sum_2020  10
Sum_2020  10
X          2
Y          3
Z          4
Sum_2021  20
Sum_2021  20

Desired output: How can I drop only the indices that contain the pattern "Sum_" or each given year (2020 repeated) from the dataframe?

           A
Sum_2019   1
X          2
Y          3
Sum_2020  10
X          2
Y          3
C          4
Sum_2021  20

Attempts:

I was trying to do this:

        df= df[~df.index.duplicated(keep='first')]

But this also removes the indices "X" and "Y" that I want to keep.

Thank you!

CodePudding user response：

Add condition for keep indices not starting by Sum_ and chain by & for bitwise AND:

df = pd.DataFrame({'A':[1,2,3,10,10,2,3,4,20,20]},
                  index=['Sum_2019','X','Y','Sum_2020','Sum_2020',
                          'X','Y','Z','Sum_2021','Sum_2021'])


df = df[~df.index.duplicated(keep='first') | ~df.index.str.startswith('Sum_')]
print (df)
           A
Sum_2019   1
X          2
Y          3
Sum_2020  10
X          2
Y          3
Z          4
Sum_2021  20

CodePudding user response：

You can use shift to compare successive label:

# identify "Sum_" indices
s = tt.index.to_series().str.contains('Sum_')

# keep those that are not Sum_ if the previous is Sum_
out = tt[~(s&s.shift())]

output:

           A
Sum_2019   1
X          2
Y          3
Sum_2020  10
A          2
B          3
C          4
Sum_2021  20