Home > Software design >  Gap in years in a column
Gap in years in a column

Time:09-19

I'd like to detect if there are non-consecutive years in my Dataframe. For instance:

Ticker Year       Gap
A      2017-01-01
A      2018-01-01 OK
A      2019-01-01 OK
A      2021-01-01 NOT OK
A      2022-01-01 OK
AAPL   2019-01-01 
AAPL   2020-01-01 OK
AAPL   2021-01-01 OK
AAPL   2022-01-01 OK
AAPL   2023-01-01 OK

What I've done is the following:

data %>%
  dplyr::select(Year, Ticker) %>%
  group_by(Ticker) %>%
  mutate(diff = Year - lag(Year))

diff is a column returning the number of days, hence if the number of days is higher than 366 I just return NOT OK.

My issue is that "diff" is also computing the number of days between the first Year from AAPL and the last year from A while I thought the "group by" would only allow the computation of year difference within groups. For instance, on the row of AAPL 2019-01-01, we would have a negative number of days since the row above, A, is dated on the 2022.

Any helps would be appreciated, thanks.

CodePudding user response:

Using diff with ave.

transform(dat, flag=ave(as.numeric(Year), Ticker, FUN=\(x) c(365, diff(x)) > 366))
#    Ticker       Year flag
# 1       A 2017-01-01    0
# 2       A 2018-01-01    0
# 3       A 2019-01-01    0
# 4       A 2021-01-01    1
# 5       A 2022-01-01    0
# 6    AAPL 2019-01-01    0
# 7    AAPL 2020-01-01    0
# 8    AAPL 2021-01-01    0
# 9    AAPL 2022-01-01    0
# 10   AAPL 2023-01-01    0

Data:

dat <- structure(list(Ticker = c("A", "A", "A", "A", "A", "AAPL", "AAPL", 
"AAPL", "AAPL", "AAPL"), Year = structure(c(17167, 17532, 17897, 
18628, 18993, 17897, 18262, 18628, 18993, 19358), class = "Date")), row.names = c(NA, 
-10L), class = "data.frame")

CodePudding user response:

With dplyr

library(tidyverse) 

df %>%
  group_by(Ticker) %>%  
  mutate(gap = case_when(year(lag(Year)) == year(Year) - 1 | Year == first(Year)  ~ "OK",
                         TRUE ~ "NOT OK"))

# A tibble: 10 x 3
# Groups:   Ticker [2]
   Ticker Year       gap   
   <chr>  <date>     <chr> 
 1 A      2017-01-01 OK    
 2 A      2018-01-01 OK    
 3 A      2019-01-01 OK    
 4 A      2021-01-01 NOT OK
 5 A      2022-01-01 OK    
 6 AAPL   2019-01-01 OK    
 7 AAPL   2020-01-01 OK    
 8 AAPL   2021-01-01 OK    
 9 AAPL   2022-01-01 OK    
10 AAPL   2023-01-01 OK    
  • Related