I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker
shown. However, the real dataframe has more tickers.
ticker | year | return |
---|---|---|
aapl | 1999 | 1 |
aapl | 2000 | 3 |
aapl | 2000 | 2 |
What I'd like to do is first group the dataframe by ticker
, then by year
. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:
ticker | year | return |
---|---|---|
aapl | 1999 | 1 |
aapl | 2000 | 3 |
I have a working solution, but it's not very "Pandas-esque", and involves for
loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.
Right now, I've been working on the following, with little luck:
df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])
This however, produces the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
Any help here would be greatly appreciated, thanks.
CodePudding user response:
@QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby
twice, inside two .apply
s:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated
function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)
CodePudding user response:
You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3
CodePudding user response:
I'm almost sure you want to do this:
df.drop_duplicates(subset=["ticker","year"])