I am trying to fill the NA values in this data frame with the most recent non-NA value in the cost column. I want to group by city - so all NAs for Omaha should be 44.50, and the NAs for Lincoln should be 62.50. Here is the code I have been using - it replaces the first NA (April) for each group with the correct value, but does not fill past that.
df <- df %>%
group_by(city) %>%
mutate(cost = ifelse(is.na(cost), lag(cost, na.rm=TRUE), cost))
Data before running code:
year month city cost
2021 January Omaha 45.50
2021 February Omaha 46.75
2021 March Omaha 44.50
2021 April Omaha NA
2021 May Omaha NA
2021 June Omaha NA
2021 January Lincoln 55.25
2021 February Lincoln 53.80
2021 March Lincoln 62.50
2021 April Lincoln NA
2021 May Lincoln NA
2021 June Lincoln NA
CodePudding user response:
Use:
library(tidyverse)
df %>%
group_by(city) %>%
fill(cost)
# A tibble: 12 x 4
# Groups: city [2]
year month city cost
<int> <chr> <chr> <dbl>
1 2021 January Omaha 45.5
2 2021 February Omaha 46.8
3 2021 March Omaha 44.5
4 2021 April Omaha 44.5
5 2021 May Omaha 44.5
6 2021 June Omaha 44.5
7 2021 January Lincoln 55.2
8 2021 February Lincoln 53.8
9 2021 March Lincoln 62.5
10 2021 April Lincoln 62.5
11 2021 May Lincoln 62.5
12 2021 June Lincoln 62.5
CodePudding user response:
With your code, you would want to use last
rather than lag
(though fill
is the much better option here). We also need to wrap cost
in na.omit
.
library(tidyverse)
df %>%
group_by(city) %>%
mutate(cost = ifelse(is.na(cost), last(na.omit(cost)), cost))
Output
year month city cost
<int> <chr> <chr> <dbl>
1 2021 January Omaha 45.5
2 2021 February Omaha 46.8
3 2021 March Omaha 44.5
4 2021 April Omaha 44.5
5 2021 May Omaha 44.5
6 2021 June Omaha 44.5
7 2021 January Lincoln 55.2
8 2021 February Lincoln 53.8
9 2021 March Lincoln 62.5
10 2021 April Lincoln 62.5
11 2021 May Lincoln 62.5
12 2021 June Lincoln 62.5
Data
df <- structure(list(year = c(2021L, 2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L, 2021L, 2021L, 2021L), month = c("January",
"February", "March", "April", "May", "June", "January", "February",
"March", "April", "May", "June"), city = c("Omaha", "Omaha",
"Omaha", "Omaha", "Omaha", "Omaha", "Lincoln", "Lincoln", "Lincoln",
"Lincoln", "Lincoln", "Lincoln"), cost = c(45.5, 46.75, 44.5,
NA, NA, NA, 55.25, 53.8, 62.5, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-12L))