I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
- When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
- When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
CodePudding user response:
You get the warning because break_index
has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index
. Try with slice(1:break_index[1])
to slice(1:first(break_index))
.
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr
1.1.0, we can use temporary grouping with .by
-
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)