I have a 2-level index DataFrame and I'd like to leave only 2 top rows for second level and add a row "Other" for the rest of rows. This is my DataFrame:
import pandas as pd
df = pd.DataFrame()
df["Idx1"] = ["A", "A", "A", "A", "B", "B", "B", "B"]
df["Idx2"] = ["X", "Y", "Z", "Q", "X", "Y", "Z", "Q"]
df["Values"] = [1,2,3,4, 1,2,3,4]
df = df.set_index(["Idx1", "Idx2"])
df
and this is how I can extract top 2
rows for first level:
res = df.groupby(level = 0).head(2)
res
However I have problem with adding "Other" column. My desired output would be:
How to do it in the most elegant way?
CodePudding user response:
I think you can convert second level of MultiIndex by Other
by counter by GroupBy.cumcount
and then aggregate sum
:
idx = df.index.get_level_values(1).where(df.groupby(level = 0).cumcount().lt(2), 'Other')
df = df.groupby(['Idx1',idx], sort=False).sum()
print (df)
Values
Idx1 Idx2
A X 1
Y 2
Other 7
B X 1
Y 2
Other 7
Performance for 1k rows in 10k rows of data:
np.random.seed(2023)
N = 10000
L = list('abcdefghijklmno')
df = pd.DataFrame({'Idx1': np.random.randint(1000, size=N),
'Idx2': np.random.choice(L, size=N),
'Values':np.random.randint(1000, size=N)}).set_index(["Idx1", "Idx2"]).sort_index()
print (df)
In [230]: %%timeit
...: idx = df.index.get_level_values(1).where(df.groupby(level = 0).cumcount().lt(2), 'Other')
...:
...: df.groupby(['Idx1',idx], sort=False).sum()
...:
5.62 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another solution from mozway
is 382 times slowier:
In [231]: %%timeit
...: (df.groupby(level='Idx1')
...: .apply(lambda d: pd.concat([d.droplevel(0).head(2),
...: d.droplevel(0).tail(-2).sum().to_frame(name='Other').T,
...: ]).rename_axis('Idx2'))
...: )
...:
2.15 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
One option using groupby.apply
to get and concat
the first two values with head
, and the sum
of the rest:
(df.groupby(level='Idx1')
.apply(lambda d: pd.concat([d.droplevel(0).head(2),
d.droplevel(0).tail(-2).sum().to_frame(name='Other').T,
]).rename_axis('Idx2'))
)
Output:
Values
Idx1 Idx2
A X 1
Y 2
Other 7
B X 1
Y 2
Other 7