I have this dataset on an R/SQL Server:
name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
I am trying to add two columns that:
- Column 1: Looks at the "number of missing years between successive rows" for each person.
- Column 2: Sum the cumulative "number of missing years" for each person
For example - the first instance of each person will be 0, and then:
# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"
name year missing_years cumulative_missing_years
1 john 2010 0 0
2 john 2011 0 0
3 john 2013 1 1
4 jack 2015 0 0
5 jack 2018 3 3
6 henry 2010 0 0
7 henry 2011 0 0
8 henry 2012 0 0
I think this can be done with a "grouped cumulative difference" and "grouped cumulative sums":
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
group_by(name) %>%
arrange(year) %>%
mutate(missing_year) = year- lag(year, default = first(year)) %>%
mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))
But I am not sure if I am doing this correctly.
Ideally, I am looking for an SQL approach or an R approach (e.g. via DBPLYR) that can be used to interact with the dataset.
Can someone please suggest an approach for doing this?
Thank you!
CodePudding user response:
Using the data in the Note at the end perform a left self join to get the next year of the same name and then subtract and take the cumulative sum.
library(sqldf)
sqldf("select a.*,
coalesce(min(b.year) - a.year - 1, 0) as missing,
sum(coalesce(min(b.year) - a.year - 1, 0)) over
(partition by a.name order by a.year) as sum
from DF a
left join DF b on a.name = b.name and a.year < b.year
group by a.name, a.year
order by a.name, a.year")
giving:
name year missing sum
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 2 2
5 jack 2018 0 2
6 john 2010 0 0
7 john 2011 1 1
8 john 2013 0 1
Note
Lines <- "name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
"
DF <- read.table(text = Lines)
CodePudding user response:
I hope this helps
name <- c(rep("John", 3), rep("jack", 2), rep("henry", 3) )
year <- c(2010, 2011, 2013, 2015, 2018, 2010, 2011, 2012)
dt <- data.frame(name = name, year = year)
# first group the data by name then order by year then mutate
dt <- dt %>%
group_by(name) %>%
arrange(year, .by_group = TRUE) %>%
mutate( mis_yr = if_else(is.na(year - lag(year, n = 1L) -1), 0,
year - lag(year, n = 1L) -1) ,
cum_yr = cumsum(mis_yr)
) %>%
ungroup()
Hare is the outcome
name year mis_yr cum_yr
<chr> <dbl> <dbl> <dbl>
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 0 0
5 jack 2018 2 2
6 John 2010 0 0
7 John 2011 0 0
8 John 2013 1 1