Create a column per year defined within a timeframe range-CodePudding

I have a dataframe with variables, 2 of which are the start and end year of the data collection timeframe.

i.e.

Var1  Var2   start   end
 1     A     2000   2022
 2     B     1990   1995
 3     C     2005   2006
...   ...     ...    ...

I would like to modify this dataframe so that a column for every year is displayed and the presence/absence of data for each year is coded as a binary variable (1/0).

Because the dataset has thousands of rows and many varied year ranges I would like this process to be automated and not manual.

i.e.

Var1  Var2   1990   1991  1992 1993 1994 1995 1996 ... 2021  2022
 1     A      0      0     0    0     0   0     0       1      1
 2     B      1      1     1    1     1   1     0       0      0     
 3     C      0      0     0    0     0   0     0       0      0

Then I would like to transform this dataframe so that the years are displayed into 1 row

i.e.

Year  Var1  Var2  Data_Availability
1990  1      A           0
1990  2      B           1
1990  3      C           0
...
2022  1      A           1
2022  2      B           0
2022  3      C           0

Any help on how to achieve this is greatly appreciated.

CodePudding user response：

This looks a bit cumbersome but it is an option. The methodology is to convert to long so start and end are one column. Then group by the Var2 and complete the sequence of year. Spread to wide to get the dummy variables then convert again to long to get the structure you want, i.e.

library(dplyr)
library(tidyr)

dd %>% 
 pivot_longer(-c(1:2)) %>% 
 group_by(Var2) %>% 
 complete(value = seq(min(value), max(value))) %>% 
 select(1, 2) %>% 
 pivot_wider(names_from = value) %>% 
 mutate_at(-1, ~ (ifelse(is.na(.), 0, 1))) %>% 
 pivot_longer(-1)

# A tibble: 87 x 3
# Groups:   Var2 [3]
#   Var2  name  value
#   <chr> <chr> <dbl>
# 1 A     2000      1
# 2 A     2001      1
# 3 A     2002      1
# 4 A     2003      1
# 5 A     2004      1
# 6 A     2005      1
# 7 A     2006      1
# 8 A     2007      1
# 9 A     2008      1
#10 A     2009      1

CodePudding user response：

My approach is to pivot longer, expanding with all years between 1990 and 2022, and identifying which rows are between start and end.

library(dplyr)
library(tidyr)
dat_long <- dat |> 
        pivot_longer(start:end, values_to = "Year")

dat_long |> 
        full_join(data.frame(Var1 = 1, Var2 = "A", Year = 1990:2022)) |>
        expand(Var1, Var2, Year) |>
        left_join(dat_long) |> 
        group_by(Var1, Var2) |> 
        fill(name) |> 
        mutate(Data_Availability = case_when(name == "start" | name == "end" & lag(name == "start") ~ 1,
                                             TRUE ~ 0)) |> 
        select(Year, Var1, Var2, Data_Availability)

CodePudding user response：

Base R solution:

# Calculate the range of years in start end dates: 
# year_rng => integer vector
year_rng <- Reduce(
  seq, 
  range(
    unlist(
      df[,c("start", "end")]
    )
  )
)

# Compute whether or not start ends fall within range, 
# and column bind other variables: data.frame => stdout(console)
cbind(
    df[,c("Var1", "Var2")],
    setNames(
      data.frame(
        t(
        vapply(
          seq_along(df$start),
          FUN = function(i){
             (df$start[i] <= year_rng & df$end[i] >= year_rng)
          },
          integer(length(year_rng))
        )
      )
    ),
    year_rng
  )
)