Suppose I have a dataframe df
which prints as follows:
Order Report Name Last Updated
6 1 Fund Balance Sheet Sep 11 10:36:28 AM
15 1 Fund Balance Sheet Sep 08 11:07:22 AM
14 2 Fund Income Statement - Operating Sep 11 10:39:39 AM
3 2 Fund Income Statement - Operating Sep 08 10:55:17 AM
5 3 Fund Income Statement - Reserve Sep 11 10:49:38 AM
0 3 Fund Income Statement - Reserve Sep 08 11:09:14 AM
4 4 General Ledger Sep 11 10:40:26 AM
13 4 General Ledger Sep 08 11:10:13 AM
12 5 Expense Register Sep 11 11:03:21 AM
8 5 Expense Register Sep 08 10:50:13 AM
7 6 Aged Payables Summary Sep 11 10:42:19 AM
11 6 Aged Payables Summary Sep 08 11:09:11 AM
2 7 Homeowner Prepayment Balance Sep 11 10:36:01 AM
10 7 Homeowner Prepayment Balance Sep 08 11:02:10 AM
1 8 Homeowner Delinquency Sep 11 10:40:20 AM
9 8 Homeowner Delinquency Sep 08 10:58:13 AM
How do I go about outputting a new df, still ordered by the order
column, which only contains rows for the most recently updated report by each name? For example, with the above input, I'd like the output to be:
Order Report Name Last Updated
6 1 Fund Balance Sheet Sep 11 10:36:28 AM
14 2 Fund Income Statement - Operating Sep 11 10:39:39 AM
5 3 Fund Income Statement - Reserve Sep 11 10:49:38 AM
4 4 General Ledger Sep 11 10:40:26 AM
12 5 Expense Register Sep 11 11:03:21 AM
7 6 Aged Payables Summary Sep 11 10:42:19 AM
2 7 Homeowner Prepayment Balance Sep 11 10:36:01 AM
1 8 Homeowner Delinquency Sep 11 10:40:20 AM
Notice that all the 9/11 reports were kept, and the 9/8 reports removed. Is there any way to do this other than with tons of looping and max
type functions?
CodePudding user response:
There's no year on there, so I'm assuming the years are all the same?
Sort the dataframe by the 'Last Updated'
. Then drop_duplicates
and keep the last.
df = df.sort_values('Last Updated')
df = df.drop_duplicates(subset = 'Report Name', keep='last')
Assuming years are the same:
import pandas as pd
cols = ['Order','Report Name','Last Updated']
data = [
[1, 'Fund Balance Sheet', 'Sep 11 10:36:28 AM'],
[1, 'Fund Balance Sheet', 'Sep 08 11:07:22 AM'],
[2, 'Fund Income Statement - Operating', 'Sep 11 10:39:39 AM'],
[2, 'Fund Income Statement - Operating', 'Sep 08 10:55:17 AM'],
[3, 'Fund Income Statement - Reserve', 'Sep 11 10:49:38 AM'],
[3, 'Fund Income Statement - Reserve', 'Sep 08 11:09:14 AM'],
[4, 'General Ledger', 'Sep 11 10:40:26 AM'],
[4, 'General Ledger', 'Sep 08 11:10:13 AM'],
[5, 'Expense Register', 'Sep 11 11:03:21 AM'],
[5, 'Expense Register', 'Sep 08 10:50:13 AM'],
[6, 'Aged Payables Summary', 'Sep 11 10:42:19 AM'],
[6, 'Aged Payables Summary', 'Sep 08 11:09:11 AM'],
[7, 'Homeowner Prepayment Balance', 'Sep 11 10:36:01 AM'],
[7, 'Homeowner Prepayment Balance', 'Sep 08 11:02:10 AM'],
[8, 'Homeowner Delinquency', 'Sep 11 10:40:20 AM'],
[8, 'Homeowner Delinquency', 'Sep 08 10:58:13 AM']]
df = pd.DataFrame(data, columns=cols)
df['Last Updated'] = pd.to_datetime(df['Last Updated'], format="%b %d %I:%M:%S %p")
df = df.sort_values('Last Updated')
df = df.drop_duplicates(subset = 'Report Name', keep='last')
df['Last Updated'] = df['Last Updated'].dt.strftime('%b %d %I:%M:%S %p')
Output:
print(df)
Order Report Name Last Updated
12 7 Homeowner Prepayment Balance Sep 11 10:36:01 AM
0 1 Fund Balance Sheet Sep 11 10:36:28 AM
2 2 Fund Income Statement - Operating Sep 11 10:39:39 AM
14 8 Homeowner Delinquency Sep 11 10:40:20 AM
6 4 General Ledger Sep 11 10:40:26 AM
10 6 Aged Payables Summary Sep 11 10:42:19 AM
4 3 Fund Income Statement - Reserve Sep 11 10:49:38 AM
8 5 Expense Register Sep 11 11:03:21 AM