Backfilling Rows Based on Max Conditions in R-CodePudding

I have a dataset that looks like this in R:

name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)

problem_data = data.frame(name, year, age)

   name year age
1  john 2010  21
2  john 2011  21
3  john 2015  21
4  alex 2014  55
5  alex 2016  55
6 peter 2010  61
7 peter 2011  61
8 peter 2012  61
9 peter 2013  61

In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example - in reality:

Peter was 61 in 2013
Peter was 60 in 2012
Peter was 59 in 2011
Peter was 58 in 2010

Sometimes years are missing - as a result:

Alex was 55 in 2016
Alex was 53 in 2014

I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" - but I am not sure how these concepts can be combined together to achieve this:

# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds

library(dplyr)

new_data = problem_data %>% 
    group_by(name) %>% 
    mutate(real_age= age - lag(age, default = age[1]))

Bur this approach has made everyone's age as 0!

# A tibble: 9 x 4
# Groups:   name [3]
  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 john   2010    21        0
2 john   2011    21        0
3 john   2015    21        0
4 alex   2014    55        0
5 alex   2016    55        0
6 peter  2010    61        0
7 peter  2011    61        0
8 peter  2012    61        0
9 peter  2013    61        0

Can someone please show me how to fix this problem?

Thank you!

CodePudding user response：

Group by 'name', use complete to get the missing 'year', fill the NA elements in 'year', 'age' with previous non-NA values, subtract the 'age' from the sequence of group index

library(dplyr)
library(tidyr)
problem_data %>% 
    group_by(name) %>% 
    complete(year = full_seq(year, period = 1)) %>% 
    fill(year, age, .direction = "downup") %>%
    mutate(real_age= age - (row_number() - 1)) %>%
   ungroup

CodePudding user response：

Do you need this:

library(dplyr)
library(tidyr)
problem_data %>% 
  group_by(name) %>% 
  complete(., year = first(year):last(year), fill = list(freq = 0)) %>%
  arrange(-year, .by_group = TRUE) %>% 
  mutate(real_age = ifelse(row_number()>1,  last(age)-row_number() 1,  last(age))) %>% 
  drop_na() # optional

  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 alex   2016    55       55
2 alex   2014    55       53
3 john   2015    21       21
4 john   2011    21       17
5 john   2010    21       16
6 peter  2013    61       61
7 peter  2012    61       60
8 peter  2011    61       59
9 peter  2010    61       58