I created a simplified version of my data since my actual data is too large to be include here.
structure(list(Name = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), Category = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Year = c(2018L, 2019L, 2020L, 2021L, 2018L,
2019L, 2020L, 2021L, 2018L, 2019L, 2020L, 2021L), Cost = c(NA,
100L, NA, NA, 99L, 45L, 34L, NA, 64L, 34L, NA, 34L)), class = "data.frame", row.names = c(NA,
-12L))
The focus here is on the column "Cost" . If there is NA, in a first step, I use the most recent year with available data in the respective category.
Data %>%
group_by(`Category`, Year) %>%
dplyr:: mutate(Cost = na.locf(Cost))
The problem now is that for "Name" A, there is no value in the "Year" 2018. That's why I get an error. How do I need to adjust, the code, that then in a second step, if there is no data available from prior year for this respective "Category", I use the data from the next available year. In this specific case it would be "Year" 2019 and the value 100.
Thank you for your help in advance.
CodePudding user response:
Use tidyr::fill
with .direction = "downup"
:
library(tidyr)
library(dplyr)
Data %>%
group_by(Category) %>%
fill(Cost, .direction = "downup")
# A tibble: 12 × 4
# Groups: Category [2]
Name Category Year Cost
<chr> <int> <int> <int>
1 A 1 2018 100
2 A 1 2019 100
3 A 1 2020 100
4 A 1 2021 100
5 B 2 2018 99
6 B 2 2019 45
7 B 2 2020 34
8 B 2 2021 34
9 C 2 2018 64
10 C 2 2019 34
11 C 2 2020 34
12 C 2 2021 34