I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe. Here is an example of what the data looks like. I'm looking for an output of one instance of an account with the product and most recent date.
account_number product sale_date
0 123 rental 2021-12-01
1 423 rental 2021-10-01
2 513 sale 2021-11-02
3 123 sale 2022-01-01
4 513 sale 2021-11-30
I was trying to use groupby
and idxmax()
but it doesn't work with dates.
And I did want to change the dtype away from date time.
data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()
Any ideas would be awesome.
CodePudding user response:
It seems the sale_date
column has strings. If you convert it to datetime dtype, then you can use groupby
idxmax
:
df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]
Output:
account_number product sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30
CodePudding user response:
Would the keyword 'first' work ? So that would be:
data.groupby('account_number')['sale_date'].first()
CodePudding user response:
To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head
(which will return NaN
s if in first row per group unlike DataFrame.groupby().first
):
data_grouped = (
data.sort_values(
["account_number", "sale_date"], ascending=[True, False]
).reset_index(drop=True)
.groupby("account_number")
.head(1)
)
CodePudding user response:
You want the last
keyword in order to get the most recent date after grouping, like this:
df.groupby(by=["account_number"])["sale_date"].last()
which will provide this output:
account_number
123 2022-01-01
423 2021-10-01
513 2021-11-30
Name: sale_date, dtype: datetime64[ns]
It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.
CodePudding user response:
I'll change my answer to use @Daniel Weigelbut's answer... and also here, where you can apply .nth(n)
to find the nth value for a general case ((-1) for the most recent date).
new_data = data.groupby('account_number')['sale_date'].nth(-1)
My previous suggestion of creating a sorted multi index with
data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])
still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.