Say I have a Pandas dataframe with multiple indices:
arrays = [["UK", "UK", "US", "FR"], ["Firm1", "Firm1", "Firm2", "Firm1"], ["Andy", "Peter", "Peter", "Andy"]]
idx = pd.MultiIndex.from_arrays(arrays, names = ("Country", "Firm", "Responsible"))
df = pd.DataFrame(np.zeros(4), index = idx)
df
0
Country Firm Responsible
UK Firm1 Andy 0.0
Peter 0.0
US Firm2 Peter 0.0
FR Firm1 Andy 0.0
I want to drop duplicated entries of the first two index levels (In the example, rows with "UK" and "Firm1" entries) and keep only the rows, where the third index "Responsible" is equal to "Andy". So I want to drop the second row in this case.
In pandas there is drop_duplicates()
but I don't see how I can i) apply it only on the first two index levels and ii) specify to keep rows with "Andy" and drop the rest (the function only allows for 'first' and 'last' as arguments)
I would be happy for inputs! Many thanks in advance.
CodePudding user response:
Remove if duplicated first 2 levels and no Andy
in Responsible
level - first use Index.to_frame
for DataFrame
, test duplicates by DataFrame.duplicated
by both first level snad keep=Fales
for all dupes and filter only Andy
rows by chained another mask by |
for bitwise OR
:
df1 = df.index.to_frame()
df = df[~df1.duplicated(subset=['Country','Firm'], keep=False) |
df1['Responsible'].eq('Andy')]
print (df)
0
Country Firm Responsible
UK Firm1 Andy 0.0
US Firm2 Peter 0.0
FR Firm1 Andy 0.0
CodePudding user response:
One simple option, drop the last level and use Index.duplicated
, then you combine it with any other mask you want to:
df[~df.index.droplevel(2).duplicated()]
# or by name:
# df[~df.index.droplevel('Responsible').duplicated()]
output:
0
Country Firm Responsible
UK Firm1 Andy 0.0
US Firm2 Peter 0.0
FR Firm1 Andy 0.0