Home > Software engineering >  R script for extracting last row by ID where the next row value for that ID is not consecutive
R script for extracting last row by ID where the next row value for that ID is not consecutive

Time:11-19

I'm preparing data for a cox regression model and I have a dataset that shows all of the years that participants were registered as living in the province. There is a variable that identifies how many days they were registered as living in the province for each year. I want their start year to be their first year that they were fully registered (>=365 days) as living in the province. I also want the last year that they were fully registered as living in the province. However, there are some participants that left the province, then returned later for at least one full-time year. For this analysis, I want to consider participants follow-up to end when they leave the first time as we can't track their health outcomes that may have occurred while outside the province.

Imagine I have already sorted the dataset by ID, then year. I then removed any observations where there were less than 365 days registered.

Here is a test dataset:

df <- data.frame(
  ID = c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3), 
  values = c(1996,1998,1999,2000,2001,2001,2002,2003,2004,2007,2008,2004,2005,2006,2007)
)

df_inc <- df %>% 
  group_by(ID) %>% 
  filter(row_number(values)==1)

This works as intended, returning the first fully registered year per participant

df_lastoverall <- df %>% 
  group_by(ID) %>% 
  filter(row_number(values)==n())

This works, but returns the last fully registered year, regardless of whether their years were all consecutive, or they left the province then returned to have at least one full year. This gives a last year of 2001 for ID1, 2008 for ID2, and 2007 for ID3.

Here's where I'm at and can use some help... I'm looking for some way to identify the last full year after a consecutive run from their start year (just incase there are people that left and returned more than once). This should return a last year of 1996 for ID1, 2004 for ID2, and 2007 for ID3.

Something like this, perhaps?

df_last <- df %>% 
  group_by(ID) %>% 
  filter(row_number(values)[cumsum(c(1, diff(values)!=1))])
# OR
df_last <- df %>% 
  group_by(ID) %>% 
  filter(row_number(values)==max(values[cumsum(c(1, diff(values)!=1))]))

CodePudding user response:

You can leverage data.table::rleid() as follows:

group_by(df,ID) %>%
  filter(data.table::rleid(c(1,diff(values)))==1)

Output:

     ID values
  <dbl>  <dbl>
1     1   1996
2     2   2001
3     2   2002
4     2   2003
5     2   2004
6     3   2004
7     3   2005
8     3   2006
9     3   2007

If you wanted only the last year of each group, you can add a second filter at the end:

group_by(df,ID) %>%
  filter(data.table::rleid(c(1,diff(values)))==1) %>% 
  filter(row_number()==n())

Output:

     ID values
  <dbl>  <dbl>
1     1   1996
2     2   2004
3     3   2007

CodePudding user response:

You could use a tidyverse approach:

library(dplyr)
library(tidyr)

df_first <- df %>% 
  group_by(ID) %>% 
  filter(cumsum(c(1,diff(values)) - 1) == 0) %>% 
  slice_min(values) %>% 
  ungroup()
  
df_last <- df %>% 
  group_by(ID) %>% 
  filter(cumsum(c(1,diff(values)) - 1) == 0) %>% 
  slice_max(values) %>% 
  ungroup()

This returns

#> df_first
# A tibble: 3 × 2
     ID values
  <dbl>  <dbl>
1     1   1996
2     2   2001
3     3   2004

and

#> df_last
# A tibble: 3 × 2
     ID values
  <dbl>  <dbl>
1     1   1996
2     2   2004
3     3   2007
  • Related