Home > database >  R: Delete Rows After First "Break" Occurs
R: Delete Rows After First "Break" Occurs

Time:02-03

I am working with the R programming language.

I have the following dataset:

library(dplyr)

my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5)) 

> my_data
  id year var
1  1 2010   1
2  1 2011   7
3  1 2012   3
4  1 2013   9
5  1 2015   5
6  1 2016   6
7  2 2015  88
8  2 2016  12
9  2 2020   5

My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.

For example:

  • When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
  • When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.

This was my attempt to write the code for this problem:

final = my_data %>%
  group_by(id) %>%
  mutate(break_index = which(diff(year) > 1)[1]) %>%
  group_by(id, add = TRUE) %>%
  slice(1:break_index)

The code appears to be working - but I get the following warning messages which are concerning me:

Warning messages:
1: In 1:break_index :
  numerical expression has 6 elements: only the first used
2: In 1:break_index :
  numerical expression has 3 elements: only the first used

Can someone please tell me if I have done this correctly?

Thanks!

CodePudding user response:

You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).

Here is another way to handle this.

library(dplyr)

my_data %>%
  group_by(id) %>%
  filter(row_number() <= which(diff(year) > 1)[1])

#     id  year   var
#  <dbl> <dbl> <dbl>
#1     1  2010     1
#2     1  2011     7
#3     1  2012     3
#4     1  2013     9
#5     2  2015    88
#6     2  2016    12

With dplyr 1.1.0, we can use temporary grouping with .by -

my_data %>%
  filter(row_number() <= which(diff(year) > 1)[1], .by = id)
  •  Tags:  
  • r
  • Related