Home > Mobile >  R Dataframe filtering: Using unique or duplicate function based on time factor
R Dataframe filtering: Using unique or duplicate function based on time factor

Time:09-25

I'm trying to filter a dataframe of loan data, but each monthly report duplicates loans if they're still outstanding, or drops loans if paid (can't just use latest monthly report). I'd like to filter through lenders for unique Date of Maturity for loans and remove duplicates and keeping only the most recent data by report date. Here's an example of the data:

df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
                                  ,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
                 , Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
                                 ,"Lender1","Lender1","Lender2","Lender2")
                 , Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
                                      ,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
                 , Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
                                 ,150000,150000,2750000,2750000))

As you can see from the example data Lender1 has 2 unique maturity dates. The first date has 3 loans that get duplicated across 2 reporting dates, and the second maturity date has 1 loan being duplicated. I want to remove duplicates to keep the newest report data. My hope is to get a dataframe that looks like this afterwards:

Reporting.date Lender.name Date.of.maturity Loan.amount
8/31/2021 Lender1 6/20/2025 13129474
8/31/2021 Lender1 6/20/2025 14643881
8/31/2021 Lender1 6/20/2025 44935677
7/31/2021 Lender1 6/30/2022 150000
12/31/2020 Lender2 5/15/2024 2750000

CodePudding user response:

You'll need to convert Reporting.date to Date format, either in a mutate (like I did) or directly in the filter.

library(tidyverse)

df %>%
  mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  filter(Reporting.date == max(Reporting.date)) %>%
  ungroup()

CodePudding user response:

We may also do this with arrange

library(dplyr)
library(lubridate)
df %>%
  arrange(Lender.name, Date.of.maturity, Loan.amount, 
         desc(mdy(Reporting.date))) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  slice_head(n = 1) %>%
  ungroup

-output

# A tibble: 5 x 4
  Reporting.date Lender.name Date.of.maturity Loan.amount
  <chr>          <chr>       <chr>                  <dbl>
1 8/31/2021      Lender1     6/20/2025           13129474
2 8/31/2021      Lender1     6/20/2025           14643881
3 8/31/2021      Lender1     6/20/2025           44935677
4 7/31/2021      Lender1     6/30/2022             150000
5 12/31/2020     Lender2     5/15/2024            2750000
  • Related