Home > Software engineering >  Expand and then fill a dataframe
Expand and then fill a dataframe

Time:04-25

I have patchy data on property rental prices over the 8 years 2014 to 2021. I want to expand to a full panel so that for each year every property has a value for each of the years. This is done by carrying over the value from one year to another. At the moment each property is an entry in a list of data frames. This R code creates some dummy data (df1, df2, df3) and gives the result (df4) that I am looking for.

year <- c(2014, 2019, 2021)
price <- c(100, 110, 120)
df1 <- data.frame(cbind(id=1, year, price))

year <- c(2016, 2019, 2021)
price <- c(200, 210, 220)
df2 <- data.frame(cbind(id=2, year, price))

year <-c (2014, 2015, 2019, 2020)
price <-c (300, 310, 320, 330)
df3 <- data.frame(cbind(id=3, year, price))

list1 <- list(df1, df2, df3)

id <- c(rep(1,8), rep(2,8), rep(3,8))
year <- c(rep(seq(2014,2021), 3))
price <- c(100, 100, 100, 100, 100, 110, 110, 120,
           200, 200, 200, 200, 200, 210, 210, 220,
           300, 310, 310, 310, 310, 320, 330, 330)
df4 <- data.frame(id, year, price)

So for property id=1 I have data for 2014, 2019 and 2021. I will forward fill in 2015 to 2018 with the 2014 price. For property id=2 I will back fill in 2014 and 2015 with 2016, and also forward fill from 2017 to 2018 with 2016.

I can start this by rbinding all the data frames in the list and then merging this with a new data frame that has all the id by year combinations. But this will not do the infilling that I want, all the missing id by year combinations will be NA.

In total I have about 100k properties to do this task for. Thanks.

CodePudding user response:

Or with purrr

years <- tibble(year = c(2014:2021))

list1 %>% 
  map_dfr(~.x %>% 
        right_join(years, by = "year") %>% 
        arrange(year) %>% 
        fill(id, .direction = "down") %>% 
        fill(id, .direction = "up") %>% 
        fill(price, .direction = "downup")) 

CodePudding user response:

Check out the fill() function function through tidyverse.

Using your example, but inducing the NA's you mention, df5 should be what you're looking for here.

library( tidyverse )
year <- c(2014, 2019, 2021)
price <- c(100, 110, 120)
df1 <- data.frame(cbind(id=1, year, price))

year <- c(2016, 2019, 2021)
price <- c(200, 210, 220)
df2 <- data.frame(cbind(id=2, year, price))

year <-c (2014, 2015, 2019, 2020)
price <-c (300, 310, 320, 330)
df3 <- data.frame(cbind(id=3, year, price))

list1 <- list(df1, df2, df3)

id <- c(rep(1,8), rep(2,8), rep(3,8))
year <- c(rep(seq(2014,2021), 3))
price <- c(100, NA, NA, NA, NA, 110, NA, 120,
           NA, NA, 200, NA, NA, 210, 210, 220,
           300, 310, 310, 310, 310, 320, 330, 330)
df4 <- data.frame(id, year, price)
df5 <- df4 %>% group_by( id ) %>% fill( price, .direction = "downup" )
  • Related