Home > Software engineering >  Backfilling Rows Based on Max Conditions in R
Backfilling Rows Based on Max Conditions in R

Time:11-09

I have a dataset that looks like this in R:

name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)

problem_data = data.frame(name, year, age)

   name year age
1  john 2010  21
2  john 2011  21
3  john 2015  21
4  alex 2014  55
5  alex 2016  55
6 peter 2010  61
7 peter 2011  61
8 peter 2012  61
9 peter 2013  61

In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example - in reality:

  • Peter was 61 in 2013
  • Peter was 60 in 2012
  • Peter was 59 in 2011
  • Peter was 58 in 2010

Sometimes years are missing - as a result:

  • Alex was 55 in 2016
  • Alex was 53 in 2014

I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" - but I am not sure how these concepts can be combined together to achieve this:

# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds

library(dplyr)

new_data = problem_data %>% 
    group_by(name) %>% 
    mutate(real_age= age - lag(age, default = age[1]))

Bur this approach has made everyone's age as 0!

# A tibble: 9 x 4
# Groups:   name [3]
  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 john   2010    21        0
2 john   2011    21        0
3 john   2015    21        0
4 alex   2014    55        0
5 alex   2016    55        0
6 peter  2010    61        0
7 peter  2011    61        0
8 peter  2012    61        0
9 peter  2013    61        0

Can someone please show me how to fix this problem?

Thank you!

CodePudding user response:

Group by 'name', use complete to get the missing 'year', fill the NA elements in 'year', 'age' with previous non-NA values, subtract the 'age' from the sequence of group index

library(dplyr)
library(tidyr)
problem_data %>% 
    group_by(name) %>% 
    complete(year = full_seq(year, period = 1)) %>% 
    fill(year, age, .direction = "downup") %>%
    mutate(real_age= age - (row_number() - 1)) %>%
   ungroup

CodePudding user response:

Do you need this:

library(dplyr)
library(tidyr)
problem_data %>% 
  group_by(name) %>% 
  complete(., year = first(year):last(year), fill = list(freq = 0)) %>%
  arrange(-year, .by_group = TRUE) %>% 
  mutate(real_age = ifelse(row_number()>1,  last(age)-row_number() 1,  last(age))) %>% 
  drop_na() # optional

  name   year   age real_age
  <chr> <dbl> <dbl>    <dbl>
1 alex   2016    55       55
2 alex   2014    55       53
3 john   2015    21       21
4 john   2011    21       17
5 john   2010    21       16
6 peter  2013    61       61
7 peter  2012    61       60
8 peter  2011    61       59
9 peter  2010    61       58
  • Related