Home > Back-end >  Reducing amount of rows in data set with multiple conditions based on dates
Reducing amount of rows in data set with multiple conditions based on dates

Time:07-25

I have a dataset with several financial ratios for several companies over the years. But I have the problem that the period for the ratios is different. A part of the dataset looks like this (dates are designed like yyyymmdd):

CopanyName     StartDate     EndDate      ROA      ROE     
AAR CORP       20050301      20050531     0.086    0.057
AAR CORP       20050301      20050831     0.092    0.065
AAR CORP       20050301      20051130     0.099    0.073
AAR CORP       20050301      20060228     0.108    0.083
AAR CORP       20060301      20060531     0.107    0.090
AAR CORP       20060301      20060831     0.117    0.110
AAR CORP       20060301      20061130     0.116    0.116
AAR CORP       20060301      20070228     0.115    0.112

ARP            20050601      20050831     0.076    0.053
ARP            20050601      20051130     0.082    0.068
ARP            20050601      20060228     0.089    0.075
ARP            20050601      20060531     0.098    0.081
ARP            20060601      20060831     0.077    0.068
ARP            20060601      20061130     0.079    0.067
ARP            20060601      20070228     0.086    0.078
ARP            20060601      20070531     0.095    0.083

Now I want to reduce the amount of rows by deleting al the quarterly updates. So, that for this simplified dataset it results in this:

CopanyName     StartDate     EndDate      ROA      ROE  
AAR CORP       20050301      20060228     0.108    0.083
AAR CORP       20060301      20070228     0.115    0.112
ARP            20050601      20060531     0.098    0.081
ARP            20060601      20070531     0.095    0.083

For an easy dataset like this I could just manually remove the other rows, but I have a datset with data for over 4000 companies and more than 150000 observations. Hope anyone can help me out.

As a follow up, I want to change the date colums. I want to create one column with a year. So it will look like this:

CopanyName     Year     ROA      ROE  
AAR CORP       2005     0.108    0.083
AAR CORP       2006     0.115    0.112
ARP            2005     0.098    0.081
ARP            2006     0.095    0.083

Hope anyone can help

CodePudding user response:

You could use lubridate's ymd to convert your dates into date objects and dplyr to do the filtering based on these, keeping only rows spanning a full year (here 360 days to be safe):

library(lubridate)
library(dplyr)

dat |>
  mutate(across(ends_with("Date"), ymd),
         Year = year(StartDate)) |>
  filter(EndDate - StartDate > 360) |>
  select(-ends_with("Date"))

In the simple case where it's every fourth row:

library(dplyr)

dat |>
  filter(row_number() %% 4 == 0) |>
  mutate(Year = substr(StartDate, start = 1, stop = 4)) |>
  select(-ends_with("Date"))

Output:

# A tibble: 4 × 4
  CopanyName   ROA   ROE  Year
  <chr>      <dbl> <dbl> <dbl>
1 AAR CORP   0.108 0.083  2005
2 AAR CORP   0.115 0.112  2006
3 ARP        0.098 0.081  2005
4 ARP        0.095 0.083  2006

Data:

dat <-
  read_csv("CopanyName,     StartDate,     EndDate,      ROA,      ROE     
AAR CORP,       20050301,      20050531,     0.086,    0.057
AAR CORP,       20050301,      20050831,     0.092,    0.065
AAR CORP,       20050301,      20051130,     0.099,    0.073
AAR CORP,       20050301,      20060228,     0.108,    0.083
AAR CORP,       20060301,      20060531,     0.107,    0.090
AAR CORP,       20060301,      20060831,     0.117,    0.110
AAR CORP,       20060301,      20061130,     0.116,    0.116
AAR CORP,       20060301,      20070228,     0.115,    0.112
ARP,            20050601,      20050831,     0.076,    0.053
ARP,            20050601,      20051130,     0.082,    0.068
ARP,            20050601,      20060228,     0.089,    0.075
ARP,            20050601,      20060531,     0.098,    0.081
ARP,            20060601,      20060831,     0.077,    0.068
ARP,            20060601,      20061130,     0.079,    0.067
ARP,            20060601,      20070228,     0.086,    0.078
ARP,            20060601,      20070531,     0.095,    0.083")
  •  Tags:  
  • r
  • Related