I'd like to detect if there are non-consecutive years in my Dataframe. For instance:
Ticker Year Gap
A 2017-01-01
A 2018-01-01 OK
A 2019-01-01 OK
A 2021-01-01 NOT OK
A 2022-01-01 OK
AAPL 2019-01-01
AAPL 2020-01-01 OK
AAPL 2021-01-01 OK
AAPL 2022-01-01 OK
AAPL 2023-01-01 OK
What I've done is the following:
data %>%
dplyr::select(Year, Ticker) %>%
group_by(Ticker) %>%
mutate(diff = Year - lag(Year))
diff is a column returning the number of days, hence if the number of days is higher than 366 I just return NOT OK.
My issue is that "diff" is also computing the number of days between the first Year from AAPL and the last year from A while I thought the "group by" would only allow the computation of year difference within groups. For instance, on the row of AAPL 2019-01-01, we would have a negative number of days since the row above, A, is dated on the 2022.
Any helps would be appreciated, thanks.
CodePudding user response:
Using diff
with ave
.
transform(dat, flag=ave(as.numeric(Year), Ticker, FUN=\(x) c(365, diff(x)) > 366))
# Ticker Year flag
# 1 A 2017-01-01 0
# 2 A 2018-01-01 0
# 3 A 2019-01-01 0
# 4 A 2021-01-01 1
# 5 A 2022-01-01 0
# 6 AAPL 2019-01-01 0
# 7 AAPL 2020-01-01 0
# 8 AAPL 2021-01-01 0
# 9 AAPL 2022-01-01 0
# 10 AAPL 2023-01-01 0
Data:
dat <- structure(list(Ticker = c("A", "A", "A", "A", "A", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL"), Year = structure(c(17167, 17532, 17897,
18628, 18993, 17897, 18262, 18628, 18993, 19358), class = "Date")), row.names = c(NA,
-10L), class = "data.frame")
CodePudding user response:
With dplyr
library(tidyverse)
df %>%
group_by(Ticker) %>%
mutate(gap = case_when(year(lag(Year)) == year(Year) - 1 | Year == first(Year) ~ "OK",
TRUE ~ "NOT OK"))
# A tibble: 10 x 3
# Groups: Ticker [2]
Ticker Year gap
<chr> <date> <chr>
1 A 2017-01-01 OK
2 A 2018-01-01 OK
3 A 2019-01-01 OK
4 A 2021-01-01 NOT OK
5 A 2022-01-01 OK
6 AAPL 2019-01-01 OK
7 AAPL 2020-01-01 OK
8 AAPL 2021-01-01 OK
9 AAPL 2022-01-01 OK
10 AAPL 2023-01-01 OK