R Dataframe filtering: Using unique or duplicate function based on time factor-CodePudding

I'm trying to filter a dataframe of loan data, but each monthly report duplicates loans if they're still outstanding, or drops loans if paid (can't just use latest monthly report). I'd like to filter through lenders for unique Date of Maturity for loans and remove duplicates and keeping only the most recent data by report date. Here's an example of the data:

df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
                                  ,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
                 , Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
                                 ,"Lender1","Lender1","Lender2","Lender2")
                 , Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
                                      ,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
                 , Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
                                 ,150000,150000,2750000,2750000))

As you can see from the example data Lender1 has 2 unique maturity dates. The first date has 3 loans that get duplicated across 2 reporting dates, and the second maturity date has 1 loan being duplicated. I want to remove duplicates to keep the newest report data. My hope is to get a dataframe that looks like this afterwards:

Reporting.date	Lender.name	Date.of.maturity	Loan.amount
8/31/2021	Lender1	6/20/2025	13129474
8/31/2021	Lender1	6/20/2025	14643881
8/31/2021	Lender1	6/20/2025	44935677
7/31/2021	Lender1	6/30/2022	150000
12/31/2020	Lender2	5/15/2024	2750000

CodePudding user response：

You'll need to convert Reporting.date to Date format, either in a mutate (like I did) or directly in the filter.

library(tidyverse)

df %>%
  mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  filter(Reporting.date == max(Reporting.date)) %>%
  ungroup()

CodePudding user response：

We may also do this with arrange

library(dplyr)
library(lubridate)
df %>%
  arrange(Lender.name, Date.of.maturity, Loan.amount, 
         desc(mdy(Reporting.date))) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  slice_head(n = 1) %>%
  ungroup

-output

# A tibble: 5 x 4
  Reporting.date Lender.name Date.of.maturity Loan.amount
  <chr>          <chr>       <chr>                  <dbl>
1 8/31/2021      Lender1     6/20/2025           13129474
2 8/31/2021      Lender1     6/20/2025           14643881
3 8/31/2021      Lender1     6/20/2025           44935677
4 7/31/2021      Lender1     6/30/2022             150000
5 12/31/2020     Lender2     5/15/2024            2750000