Home > front end >  Adding rows to fill "missing" years that are based on the values of the groups in the prev
Adding rows to fill "missing" years that are based on the values of the groups in the prev

Time:03-15

I have a dataframe in the following format:

df <- data.frame(year = c(2000, 2000, 2000, 2000, 2000, 2004, 2004, 2004, 2004, 2004,
                             2010, 2010, 2010, 2010, 2010),
                    city = c("City A", "City B", "City C", "City D", "City E", 
                             "City A", "City B", "City C", "City D", "City E",
                             "City A", "City B", "City C", "City D", "City E"),
                    constant_y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
df

   year   city constant_y
1  2000 City A          1
2  2000 City B          2
3  2000 City C          3
4  2000 City D          4
5  2000 City E          5
6  2004 City A          6
7  2004 City B          7
8  2004 City C          8
9  2004 City D          9
10 2004 City E         10
11 2010 City A         11
12 2010 City B         12
13 2010 City C         13
14 2010 City D         14
15 2010 City E         15

I'd like to fill in/add the missing years for each city, using data from the prior year for that city. So in a way duplicate rows while changing the year column value, grouping by city. Below is the output that I am trying to get at for each city (City A as example)

   year   city constant_y
1  2000 City A          1
2  2001 City A          1
3  2002 City A          1
4  2003 City A          1
5  2004 City A          6
6  2005 City A          6
7  2006 City A          6
8  2007 City A          6
9  2008 City A          6
10 2009 City A          6  
11 2010 City A         11
12 2011 City A         11
13 2012 City A         11
14 2013 City A         11
15 2014 City A         11  
16 2015 City A         11
17 2016 City A         11
18 2017 City A         11
19 2018 City A         11  
20 2019 City A         11      

And then the same for City B, C, D etc. (using their "constant_y" values in prior years). E.g. City B would have 2 until 2003, then 7 from 2004 to 2009, and then 12 from 2010 to 2019.

So yes, I just want to add rows that duplicate/uses each city's "constant_y" to the following year. My data stops at some year (2010), but I want to use the values from to 2010 to extend it some years further, e.g. 2019 in the example above. I hope I am not overcomplicating this, but I am not sure how to solve it

CodePudding user response:

Here's one method that starts by finding all possible city/year combinations, joining this on the original data, and then filling (via last-observation-carry-forward techniques) constant_y per-city.

dplyr

library(dplyr)
library(tidyr) # expand, fill
df %>%
  expand(city, year = do.call(seq, as.list(range(year)))) %>%
  full_join(df, by = c("city", "year")) %>%
  arrange(city, year) %>%
  fill(constant_y)
# # A tibble: 55 x 3
#    city    year constant_y
#    <chr>  <dbl>      <dbl>
#  1 City A  2000          1
#  2 City A  2001          1
#  3 City A  2002          1
#  4 City A  2003          1
#  5 City A  2004          6
#  6 City A  2005          6
#  7 City A  2006          6
#  8 City A  2007          6
#  9 City A  2008          6
# 10 City A  2009          6
# # ... with 45 more rows

Granted, this only goes out to 2010, since that's all that was in your original data. If you need it to go beyond the original data, then change to

df %>%
  expand(city, year = do.call(seq, as.list(range(c(year, 2019))))) %>%
  ...                                          # ^^^^^^^^^^^^^ different

base R

# library(zoo) # na.locf
df2 <- merge(
  df,
  expand.grid(city = unique(df$city), year = do.call(seq, as.list(range(df$year)))),
  by = c("city", "year"), all = TRUE)
df2$constant_y <- ave(df2$constant_y, df2$city, FUN = zoo::na.locf, na.rm = FALSE)
subset(df2, city == "City A")
#      city year constant_y
# 1  City A 2000          1
# 2  City A 2001          1
# 3  City A 2002          1
# 4  City A 2003          1
# 5  City A 2004          6
# 6  City A 2005          6
# 7  City A 2006          6
# 8  City A 2007          6
# 9  City A 2008          6
# 10 City A 2009          6
# 11 City A 2010         11

(Same with with 2010 vs 2019.)

data.table

library(data.table)
DT <- as.data.table(df) # canonical would be `setDT(df)` instead
DT <- DT[, CJ(city = unique(city), year = do.call(seq, as.list(range(year))))
  ][DT, constant_y := i.constant_y, on = .(city, year)
  ][, constant_y := nafill(constant_y, type = "locf"), by = .(city)]
DT
#       city  year constant_y
#     <char> <int>      <num>
#  1: City A  2000          1
#  2: City A  2001          1
#  3: City A  2002          1
#  4: City A  2003          1
#  5: City A  2004          6
#  6: City A  2005          6
#  7: City A  2006          6
#  8: City A  2007          6
#  9: City A  2008          6
# 10: City A  2009          6
# ---                        
# 46: City E  2001          5
# 47: City E  2002          5
# 48: City E  2003          5
# 49: City E  2004         10
# 50: City E  2005         10
# 51: City E  2006         10
# 52: City E  2007         10
# 53: City E  2008         10
# 54: City E  2009         10
# 55: City E  2010         15
  •  Tags:  
  • r
  • Related