Filter for most recent event by group with pandas-CodePudding

I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe. Here is an example of what the data looks like. I'm looking for an output of one instance of an account with the product and most recent date.

account_number product   sale_date
0             123  rental  2021-12-01
1             423  rental  2021-10-01
2             513    sale  2021-11-02
3             123    sale  2022-01-01
4             513    sale  2021-11-30

I was trying to use groupby and idxmax() but it doesn't work with dates.

And I did want to change the dtype away from date time.

data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()

Any ideas would be awesome.

CodePudding user response：

It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby idxmax:

df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]

Output:

   account_number product  sale_date
3             123    sale 2022-01-01
1             423  rental 2021-10-01
4             513    sale 2021-11-30

CodePudding user response：

Would the keyword 'first' work ? So that would be:

data.groupby('account_number')['sale_date'].first()

CodePudding user response：

To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head (which will return NaNs if in first row per group unlike DataFrame.groupby().first):

data_grouped = (
    data.sort_values(
         ["account_number", "sale_date"], ascending=[True, False]
        ).reset_index(drop=True)
        .groupby("account_number")
        .head(1)
)

CodePudding user response：

You want the last keyword in order to get the most recent date after grouping, like this:

df.groupby(by=["account_number"])["sale_date"].last()

which will provide this output:

account_number
123   2022-01-01
423   2021-10-01
513   2021-11-30
Name: sale_date, dtype: datetime64[ns]

It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.

CodePudding user response：

I'll change my answer to use @Daniel Weigelbut's answer... and also here, where you can apply .nth(n) to find the nth value for a general case ((-1) for the most recent date).

new_data = data.groupby('account_number')['sale_date'].nth(-1)

My previous suggestion of creating a sorted multi index with

data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])

still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.