Pandas: Display One Row per Column Value based on Date-CodePudding

Suppose I have a dataframe df which prints as follows:

    Order                        Report Name        Last Updated
6       1                 Fund Balance Sheet  Sep 11 10:36:28 AM
15      1                 Fund Balance Sheet  Sep 08 11:07:22 AM
14      2  Fund Income Statement - Operating  Sep 11 10:39:39 AM
3       2  Fund Income Statement - Operating  Sep 08 10:55:17 AM
5       3    Fund Income Statement - Reserve  Sep 11 10:49:38 AM
0       3    Fund Income Statement - Reserve  Sep 08 11:09:14 AM
4       4                     General Ledger  Sep 11 10:40:26 AM
13      4                     General Ledger  Sep 08 11:10:13 AM
12      5                   Expense Register  Sep 11 11:03:21 AM
8       5                   Expense Register  Sep 08 10:50:13 AM
7       6              Aged Payables Summary  Sep 11 10:42:19 AM
11      6              Aged Payables Summary  Sep 08 11:09:11 AM
2       7       Homeowner Prepayment Balance  Sep 11 10:36:01 AM
10      7       Homeowner Prepayment Balance  Sep 08 11:02:10 AM
1       8              Homeowner Delinquency  Sep 11 10:40:20 AM
9       8              Homeowner Delinquency  Sep 08 10:58:13 AM

How do I go about outputting a new df, still ordered by the order column, which only contains rows for the most recently updated report by each name? For example, with the above input, I'd like the output to be:

    Order                        Report Name        Last Updated
6       1                 Fund Balance Sheet  Sep 11 10:36:28 AM
14      2  Fund Income Statement - Operating  Sep 11 10:39:39 AM
5       3    Fund Income Statement - Reserve  Sep 11 10:49:38 AM
4       4                     General Ledger  Sep 11 10:40:26 AM
12      5                   Expense Register  Sep 11 11:03:21 AM
7       6              Aged Payables Summary  Sep 11 10:42:19 AM
2       7       Homeowner Prepayment Balance  Sep 11 10:36:01 AM
1       8              Homeowner Delinquency  Sep 11 10:40:20 AM

Notice that all the 9/11 reports were kept, and the 9/8 reports removed. Is there any way to do this other than with tons of looping and max type functions?

CodePudding user response：

There's no year on there, so I'm assuming the years are all the same?

Sort the dataframe by the 'Last Updated'. Then drop_duplicates and keep the last.

df = df.sort_values('Last Updated')
df = df.drop_duplicates(subset = 'Report Name', keep='last')

Assuming years are the same:

import pandas as pd


cols = ['Order','Report Name','Last Updated']

data =  [   
[1,                 'Fund Balance Sheet',  'Sep 11 10:36:28 AM'],
[1,                 'Fund Balance Sheet',  'Sep 08 11:07:22 AM'],
[2,  'Fund Income Statement - Operating',  'Sep 11 10:39:39 AM'],
[2,  'Fund Income Statement - Operating',  'Sep 08 10:55:17 AM'],
[3,    'Fund Income Statement - Reserve',  'Sep 11 10:49:38 AM'],
[3,    'Fund Income Statement - Reserve',  'Sep 08 11:09:14 AM'],
[4,                     'General Ledger',  'Sep 11 10:40:26 AM'],
[4,                     'General Ledger',  'Sep 08 11:10:13 AM'],
[5,                   'Expense Register',  'Sep 11 11:03:21 AM'],
[5,                   'Expense Register',  'Sep 08 10:50:13 AM'],
[6,              'Aged Payables Summary',  'Sep 11 10:42:19 AM'],
[6,              'Aged Payables Summary',  'Sep 08 11:09:11 AM'],
[7,       'Homeowner Prepayment Balance',  'Sep 11 10:36:01 AM'],
[7,       'Homeowner Prepayment Balance',  'Sep 08 11:02:10 AM'],
[8,              'Homeowner Delinquency',  'Sep 11 10:40:20 AM'],
[8,              'Homeowner Delinquency',  'Sep 08 10:58:13 AM']]

df = pd.DataFrame(data, columns=cols)
df['Last Updated'] = pd.to_datetime(df['Last Updated'], format="%b %d %I:%M:%S %p")

df = df.sort_values('Last Updated')
df = df.drop_duplicates(subset = 'Report Name', keep='last')
df['Last Updated'] = df['Last Updated'].dt.strftime('%b %d %I:%M:%S %p')

Output:

print(df)
    Order                        Report Name        Last Updated
12      7       Homeowner Prepayment Balance  Sep 11 10:36:01 AM
0       1                 Fund Balance Sheet  Sep 11 10:36:28 AM
2       2  Fund Income Statement - Operating  Sep 11 10:39:39 AM
14      8              Homeowner Delinquency  Sep 11 10:40:20 AM
6       4                     General Ledger  Sep 11 10:40:26 AM
10      6              Aged Payables Summary  Sep 11 10:42:19 AM
4       3    Fund Income Statement - Reserve  Sep 11 10:49:38 AM
8       5                   Expense Register  Sep 11 11:03:21 AM