I have a dataset with several financial ratios for several companies over the years. But I have the problem that the period for the ratios is different. A part of the dataset looks like this (dates are designed like yyyymmdd):
CopanyName StartDate EndDate ROA ROE
AAR CORP 20050301 20050531 0.086 0.057
AAR CORP 20050301 20050831 0.092 0.065
AAR CORP 20050301 20051130 0.099 0.073
AAR CORP 20050301 20060228 0.108 0.083
AAR CORP 20060301 20060531 0.107 0.090
AAR CORP 20060301 20060831 0.117 0.110
AAR CORP 20060301 20061130 0.116 0.116
AAR CORP 20060301 20070228 0.115 0.112
ARP 20050601 20050831 0.076 0.053
ARP 20050601 20051130 0.082 0.068
ARP 20050601 20060228 0.089 0.075
ARP 20050601 20060531 0.098 0.081
ARP 20060601 20060831 0.077 0.068
ARP 20060601 20061130 0.079 0.067
ARP 20060601 20070228 0.086 0.078
ARP 20060601 20070531 0.095 0.083
Now I want to reduce the amount of rows by deleting al the quarterly updates. So, that for this simplified dataset it results in this:
CopanyName StartDate EndDate ROA ROE
AAR CORP 20050301 20060228 0.108 0.083
AAR CORP 20060301 20070228 0.115 0.112
ARP 20050601 20060531 0.098 0.081
ARP 20060601 20070531 0.095 0.083
For an easy dataset like this I could just manually remove the other rows, but I have a datset with data for over 4000 companies and more than 150000 observations. Hope anyone can help me out.
As a follow up, I want to change the date colums. I want to create one column with a year. So it will look like this:
CopanyName Year ROA ROE
AAR CORP 2005 0.108 0.083
AAR CORP 2006 0.115 0.112
ARP 2005 0.098 0.081
ARP 2006 0.095 0.083
Hope anyone can help
CodePudding user response:
You could use lubridate
's ymd
to convert your dates into date objects and dplyr
to do the filtering based on these, keeping only rows spanning a full year (here 360 days to be safe):
library(lubridate)
library(dplyr)
dat |>
mutate(across(ends_with("Date"), ymd),
Year = year(StartDate)) |>
filter(EndDate - StartDate > 360) |>
select(-ends_with("Date"))
In the simple case where it's every fourth row:
library(dplyr)
dat |>
filter(row_number() %% 4 == 0) |>
mutate(Year = substr(StartDate, start = 1, stop = 4)) |>
select(-ends_with("Date"))
Output:
# A tibble: 4 × 4
CopanyName ROA ROE Year
<chr> <dbl> <dbl> <dbl>
1 AAR CORP 0.108 0.083 2005
2 AAR CORP 0.115 0.112 2006
3 ARP 0.098 0.081 2005
4 ARP 0.095 0.083 2006
Data:
dat <-
read_csv("CopanyName, StartDate, EndDate, ROA, ROE
AAR CORP, 20050301, 20050531, 0.086, 0.057
AAR CORP, 20050301, 20050831, 0.092, 0.065
AAR CORP, 20050301, 20051130, 0.099, 0.073
AAR CORP, 20050301, 20060228, 0.108, 0.083
AAR CORP, 20060301, 20060531, 0.107, 0.090
AAR CORP, 20060301, 20060831, 0.117, 0.110
AAR CORP, 20060301, 20061130, 0.116, 0.116
AAR CORP, 20060301, 20070228, 0.115, 0.112
ARP, 20050601, 20050831, 0.076, 0.053
ARP, 20050601, 20051130, 0.082, 0.068
ARP, 20050601, 20060228, 0.089, 0.075
ARP, 20050601, 20060531, 0.098, 0.081
ARP, 20060601, 20060831, 0.077, 0.068
ARP, 20060601, 20061130, 0.079, 0.067
ARP, 20060601, 20070228, 0.086, 0.078
ARP, 20060601, 20070531, 0.095, 0.083")