I'm trying to filter a dataframe of loan data, but each monthly report duplicates loans if they're still outstanding, or drops loans if paid (can't just use latest monthly report). I'd like to filter through lenders for unique Date of Maturity for loans and remove duplicates and keeping only the most recent data by report date. Here's an example of the data:
df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
, Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
,"Lender1","Lender1","Lender2","Lender2")
, Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
, Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
,150000,150000,2750000,2750000))
As you can see from the example data Lender1 has 2 unique maturity dates. The first date has 3 loans that get duplicated across 2 reporting dates, and the second maturity date has 1 loan being duplicated. I want to remove duplicates to keep the newest report data. My hope is to get a dataframe that looks like this afterwards:
Reporting.date | Lender.name | Date.of.maturity | Loan.amount |
---|---|---|---|
8/31/2021 | Lender1 | 6/20/2025 | 13129474 |
8/31/2021 | Lender1 | 6/20/2025 | 14643881 |
8/31/2021 | Lender1 | 6/20/2025 | 44935677 |
7/31/2021 | Lender1 | 6/30/2022 | 150000 |
12/31/2020 | Lender2 | 5/15/2024 | 2750000 |
CodePudding user response:
You'll need to convert Reporting.date
to Date format, either in a mutate
(like I did) or directly in the filter
.
library(tidyverse)
df %>%
mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
filter(Reporting.date == max(Reporting.date)) %>%
ungroup()
CodePudding user response:
We may also do this with arrange
library(dplyr)
library(lubridate)
df %>%
arrange(Lender.name, Date.of.maturity, Loan.amount,
desc(mdy(Reporting.date))) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
slice_head(n = 1) %>%
ungroup
-output
# A tibble: 5 x 4
Reporting.date Lender.name Date.of.maturity Loan.amount
<chr> <chr> <chr> <dbl>
1 8/31/2021 Lender1 6/20/2025 13129474
2 8/31/2021 Lender1 6/20/2025 14643881
3 8/31/2021 Lender1 6/20/2025 44935677
4 7/31/2021 Lender1 6/30/2022 150000
5 12/31/2020 Lender2 5/15/2024 2750000