mutate a variable according to series of values (first and last row. of occurence)-CodePudding

I have a data frame tel2 with two variable: nuc and occ.

I would like to create another variable "start" that identify series of same occ value and equal to the nuc value of the first occurence of the occ value (in the serie).

I would also like to create the variable "stop" that identify series of same occ value and equal to the nuc value of the last occurence of the occ value (in the series).

See below for representative example of data and desired output. In the example, the first series is the three first 0. So, for the first three rows I want 100 for start and 102 for stop. If there is two or more series of a same value, I would like to treat them separately as for series of 4 (see example below).

I wanted to group_by occ but it won't work for multiple series of the same occ value.

Any Idea to help me ?

Ideally, I would like a code runing fast as my real data frame has 10 000 000 rows.

Thank you for your time

# data

tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
               bb=c(189,176,185,130,200,189,156,123,198,189),
               ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))

nuc2=100:200

library(data.table)

tel2=as.data.frame(setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)])

> tel2
    nuc occ
1   100   0
2   101   0
3   102   0
4   103   1
5   104   1
6   105   1
7   106   1
8   107   1
9   108   1
10  109   2
11  110   3
12  111   3
13  112   3
14  113   4
15  114   4
16  115   4
17  116   4
18  117   5
19  118   5
20  119   5
21  120   5
22  121   5
23  122   5
24  123   5
25  124   4
26  125   4
27  126   4
28  127   4
29  128   4
30  129   4
31  130   4
32  131   3
33  132   3
34  133   3
35  134   3
36  135   3
37  136   3
38  137   3
39  138   3
40  139   3
41  140   3
42  141   3
43  142   3
44  143   3
45  144   3
46  145   3
47  146   3
48  147   3
49  148   3
50  149   3
51  150   3
52  151   3
53  152   3
54  153   4
55  154   4
56  155   4
57  156   4
58  157   3
59  158   3
60  159   3
61  160   3
62  161   3
63  162   4
64  163   5
65  164   5
66  165   5
67  166   5
68  167   5
69  168   5
70  169   5
71  170   5
72  171   5
73  172   5
74  173   5
75  174   5
76  175   5
77  176   5
78  177   4
79  178   4
80  179   4
81  180   4
82  181   4
83  182   4
84  183   4
85  184   4
86  185   4
87  186   4
88  187   4
89  188   4
90  189   4
91  190   1
92  191   1
93  192   1
94  193   2
95  194   2
96  195   2
97  196   2
98  197   2
99  198   2
100 199   1
101 200   1

# output (first 29 rows)
> tel_output
   nuc occ start stop
1  100   0   100  102
2  101   0   100  102
3  102   0   100  102
4  103   1   103  108
5  104   1   103  108
6  105   1   103  108
7  106   1   103  108
8  107   1   103  108
9  108   1   103  108
10 109   2   109  109
11 110   3   110  112
12 111   3   110  112
13 112   3   110  112
14 113   4   103  116
15 114   4   103  116
16 115   4   103  116
17 116   4   103  116
18 117   5   117  123
19 118   5   117  123
20 119   5   117  123
21 120   5   117  123
22 121   5   117  123
23 122   5   117  123
24 123   5   117  123
25 124   4   124  128
26 125   4   124  128
27 126   4   124  128
28 127   4   124  128
29 128   4   124  128

CodePudding user response：

We may use rleid on 'occ' to create a grouping column before we create the first and last from the 'nuc' column for 'start', 'stop' columns

library(dplyr)
library(data.table)
tel2 %>% 
   group_by(grp = rleid(occ)) %>%
   mutate(start = first(nuc), stop = last(nuc)) %>%
   ungroup %>%
   select(-grp)

-output

# A tibble: 101 × 4
     nuc   occ start  stop
   <int> <int> <int> <int>
 1   100     0   100   102
 2   101     0   100   102
 3   102     0   100   102
 4   103     1   103   108
 5   104     1   103   108
 6   105     1   103   108
 7   106     1   103   108
 8   107     1   103   108
 9   108     1   103   108
10   109     2   109   109
# … with 91 more rows

CodePudding user response：

You can group_by unique occurrences of occ by calculating the cumsum of it if it is different from the lead value (to account for repeating occ values).

library(tidyverse)

tel2 %>% 
  group_by(tmp = c(0, na.omit(cumsum(lead(occ) != occ)))) %>% 
  mutate(start = first(nuc),
         end = last(nuc)) %>% 
  ungroup() %>% 
  select(-tmp)

# A tibble: 101 × 4
     nuc   occ start   end
   <int> <int> <int> <int>
 1   100     0   100   102
 2   101     0   100   102
 3   102     0   100   102
 4   103     1   103   108
 5   104     1   103   108
 6   105     1   103   108
 7   106     1   103   108
 8   107     1   103   108
 9   108     1   103   108
10   109     2   109   109
11   110     3   110   112
12   111     3   110   112
13   112     3   110   112
14   113     4   113   116
15   114     4   113   116
16   115     4   113   116
17   116     4   113   116
18   117     5   117   123
19   118     5   117   123
20   119     5   117   123