I have a dataset that looks like this in R:
name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)
problem_data = data.frame(name, year, age)
name year age
1 john 2010 21
2 john 2011 21
3 john 2015 21
4 alex 2014 55
5 alex 2016 55
6 peter 2010 61
7 peter 2011 61
8 peter 2012 61
9 peter 2013 61
In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example - in reality:
- Peter was 61 in 2013
- Peter was 60 in 2012
- Peter was 59 in 2011
- Peter was 58 in 2010
Sometimes years are missing - as a result:
- Alex was 55 in 2016
- Alex was 53 in 2014
I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" - but I am not sure how these concepts can be combined together to achieve this:
# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds
library(dplyr)
new_data = problem_data %>%
group_by(name) %>%
mutate(real_age= age - lag(age, default = age[1]))
Bur this approach has made everyone's age as 0!
# A tibble: 9 x 4
# Groups: name [3]
name year age real_age
<chr> <dbl> <dbl> <dbl>
1 john 2010 21 0
2 john 2011 21 0
3 john 2015 21 0
4 alex 2014 55 0
5 alex 2016 55 0
6 peter 2010 61 0
7 peter 2011 61 0
8 peter 2012 61 0
9 peter 2013 61 0
Can someone please show me how to fix this problem?
Thank you!
CodePudding user response:
Group by 'name', use complete
to get the missing 'year', fill
the NA
elements in 'year', 'age' with previous non-NA values, subtract the 'age' from the sequence of group index
library(dplyr)
library(tidyr)
problem_data %>%
group_by(name) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, age, .direction = "downup") %>%
mutate(real_age= age - (row_number() - 1)) %>%
ungroup
CodePudding user response:
Do you need this:
library(dplyr)
library(tidyr)
problem_data %>%
group_by(name) %>%
complete(., year = first(year):last(year), fill = list(freq = 0)) %>%
arrange(-year, .by_group = TRUE) %>%
mutate(real_age = ifelse(row_number()>1, last(age)-row_number() 1, last(age))) %>%
drop_na() # optional
name year age real_age
<chr> <dbl> <dbl> <dbl>
1 alex 2016 55 55
2 alex 2014 55 53
3 john 2015 21 21
4 john 2011 21 17
5 john 2010 21 16
6 peter 2013 61 61
7 peter 2012 61 60
8 peter 2011 61 59
9 peter 2010 61 58