Home > Enterprise >  Cumulative Differences Per Group in R/SQL
Cumulative Differences Per Group in R/SQL

Time:10-27

I have this dataset on an R/SQL Server:

   name year
1  john 2010
2  john 2011
3  john 2013
4  jack 2015
5  jack 2018
6 henry 2010
7 henry 2011
8 henry 2012

I am trying to add two columns that:

  • Column 1: Looks at the "number of missing years between successive rows" for each person.
  • Column 2: Sum the cumulative "number of missing years" for each person

For example - the first instance of each person will be 0, and then:

# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"

   name year missing_years cumulative_missing_years
1  john 2010             0                        0
2  john 2011             0                        0
3  john 2013             1                        1
4  jack 2015             0                        0
5  jack 2018             3                        3
6 henry 2010             0                        0
7 henry 2011             0                        0
8 henry 2012             0                        0

I think this can be done with a "grouped cumulative difference" and "grouped cumulative sums":

 library(dplyr)
 library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
    group_by(name) %>%
    arrange(year) %>%
    mutate(missing_year) = year- lag(year, default = first(year))  %>%
 mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))

But I am not sure if I am doing this correctly.

Ideally, I am looking for an SQL approach or an R approach (e.g. via DBPLYR) that can be used to interact with the dataset.

Can someone please suggest an approach for doing this?

Thank you!

CodePudding user response:

Using the data in the Note at the end perform a left self join to get the next year of the same name and then subtract and take the cumulative sum.

library(sqldf)

sqldf("select a.*, 
    coalesce(min(b.year) - a.year - 1, 0) as missing,
    sum(coalesce(min(b.year) - a.year - 1, 0)) over 
      (partition by a.name order by a.year) as sum
  from DF a 
  left join DF b on a.name = b.name and a.year < b.year
  group by a.name, a.year
  order by a.name, a.year")

giving:

   name year missing sum
1 henry 2010       0   0
2 henry 2011       0   0
3 henry 2012       0   0
4  jack 2015       2   2
5  jack 2018       0   2
6  john 2010       0   0
7  john 2011       1   1
8  john 2013       0   1

Note

Lines <- "name year
1  john 2010
2  john 2011
3  john 2013
4  jack 2015
5  jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
"
DF <- read.table(text = Lines)

CodePudding user response:

I hope this helps

name <- c(rep("John", 3), rep("jack", 2),  rep("henry", 3) )

year <- c(2010, 2011, 2013, 2015, 2018, 2010, 2011, 2012)


dt <- data.frame(name = name, year = year)

# first group the data by name then order by year then mutate
dt <- dt %>% 
  group_by(name) %>% 
  arrange(year, .by_group = TRUE) %>% 
  mutate( mis_yr = if_else(is.na(year - lag(year, n = 1L) -1), 0, 
                           year - lag(year, n = 1L) -1)   ,
          cum_yr = cumsum(mis_yr)
          ) %>% 
  ungroup()

Hare is the outcome

name   year mis_yr cum_yr
  <chr> <dbl>  <dbl>  <dbl>
1 henry  2010      0      0
2 henry  2011      0      0
3 henry  2012      0      0
4 jack   2015      0      0
5 jack   2018      2      2
6 John   2010      0      0
7 John   2011      0      0
8 John   2013      1      1
  • Related