I have a dataframe with variables, 2 of which are the start and end year of the data collection timeframe.
i.e.
Var1 Var2 start end
1 A 2000 2022
2 B 1990 1995
3 C 2005 2006
... ... ... ...
I would like to modify this dataframe so that a column for every year is displayed and the presence/absence of data for each year is coded as a binary variable (1/0).
Because the dataset has thousands of rows and many varied year ranges I would like this process to be automated and not manual.
i.e.
Var1 Var2 1990 1991 1992 1993 1994 1995 1996 ... 2021 2022
1 A 0 0 0 0 0 0 0 1 1
2 B 1 1 1 1 1 1 0 0 0
3 C 0 0 0 0 0 0 0 0 0
Then I would like to transform this dataframe so that the years are displayed into 1 row
i.e.
Year Var1 Var2 Data_Availability
1990 1 A 0
1990 2 B 1
1990 3 C 0
...
2022 1 A 1
2022 2 B 0
2022 3 C 0
Any help on how to achieve this is greatly appreciated.
CodePudding user response:
This looks a bit cumbersome but it is an option. The methodology is to convert to long so start
and end
are one column. Then group by the Var2
and complete the sequence of year. Spread to wide to get the dummy variables then convert again to long to get the structure you want, i.e.
library(dplyr)
library(tidyr)
dd %>%
pivot_longer(-c(1:2)) %>%
group_by(Var2) %>%
complete(value = seq(min(value), max(value))) %>%
select(1, 2) %>%
pivot_wider(names_from = value) %>%
mutate_at(-1, ~ (ifelse(is.na(.), 0, 1))) %>%
pivot_longer(-1)
# A tibble: 87 x 3
# Groups: Var2 [3]
# Var2 name value
# <chr> <chr> <dbl>
# 1 A 2000 1
# 2 A 2001 1
# 3 A 2002 1
# 4 A 2003 1
# 5 A 2004 1
# 6 A 2005 1
# 7 A 2006 1
# 8 A 2007 1
# 9 A 2008 1
#10 A 2009 1
CodePudding user response:
My approach is to pivot longer, expanding with all years between 1990 and 2022, and identifying which rows are between start and end.
library(dplyr)
library(tidyr)
dat_long <- dat |>
pivot_longer(start:end, values_to = "Year")
dat_long |>
full_join(data.frame(Var1 = 1, Var2 = "A", Year = 1990:2022)) |>
expand(Var1, Var2, Year) |>
left_join(dat_long) |>
group_by(Var1, Var2) |>
fill(name) |>
mutate(Data_Availability = case_when(name == "start" | name == "end" & lag(name == "start") ~ 1,
TRUE ~ 0)) |>
select(Year, Var1, Var2, Data_Availability)
CodePudding user response:
Base R solution:
# Calculate the range of years in start end dates:
# year_rng => integer vector
year_rng <- Reduce(
seq,
range(
unlist(
df[,c("start", "end")]
)
)
)
# Compute whether or not start ends fall within range,
# and column bind other variables: data.frame => stdout(console)
cbind(
df[,c("Var1", "Var2")],
setNames(
data.frame(
t(
vapply(
seq_along(df$start),
FUN = function(i){
(df$start[i] <= year_rng & df$end[i] >= year_rng)
},
integer(length(year_rng))
)
)
),
year_rng
)
)