I have a data frame tel2 with two variable: nuc and occ.
I would like to create another variable "start" that identify series of same occ value and equal to the nuc value of the first occurence of the occ value (in the serie).
I would also like to create the variable "stop" that identify series of same occ value and equal to the nuc value of the last occurence of the occ value (in the series).
See below for representative example of data and desired output. In the example, the first series is the three first 0. So, for the first three rows I want 100 for start and 102 for stop. If there is two or more series of a same value, I would like to treat them separately as for series of 4 (see example below).
I wanted to group_by
occ but it won't work for multiple series of the same occ value.
Any Idea to help me ?
Ideally, I would like a code runing fast as my real data frame has 10 000 000 rows.
Thank you for your time
# data
tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
bb=c(189,176,185,130,200,189,156,123,198,189),
ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))
nuc2=100:200
library(data.table)
tel2=as.data.frame(setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)])
> tel2
nuc occ
1 100 0
2 101 0
3 102 0
4 103 1
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 2
11 110 3
12 111 3
13 112 3
14 113 4
15 114 4
16 115 4
17 116 4
18 117 5
19 118 5
20 119 5
21 120 5
22 121 5
23 122 5
24 123 5
25 124 4
26 125 4
27 126 4
28 127 4
29 128 4
30 129 4
31 130 4
32 131 3
33 132 3
34 133 3
35 134 3
36 135 3
37 136 3
38 137 3
39 138 3
40 139 3
41 140 3
42 141 3
43 142 3
44 143 3
45 144 3
46 145 3
47 146 3
48 147 3
49 148 3
50 149 3
51 150 3
52 151 3
53 152 3
54 153 4
55 154 4
56 155 4
57 156 4
58 157 3
59 158 3
60 159 3
61 160 3
62 161 3
63 162 4
64 163 5
65 164 5
66 165 5
67 166 5
68 167 5
69 168 5
70 169 5
71 170 5
72 171 5
73 172 5
74 173 5
75 174 5
76 175 5
77 176 5
78 177 4
79 178 4
80 179 4
81 180 4
82 181 4
83 182 4
84 183 4
85 184 4
86 185 4
87 186 4
88 187 4
89 188 4
90 189 4
91 190 1
92 191 1
93 192 1
94 193 2
95 194 2
96 195 2
97 196 2
98 197 2
99 198 2
100 199 1
101 200 1
# output (first 29 rows)
> tel_output
nuc occ start stop
1 100 0 100 102
2 101 0 100 102
3 102 0 100 102
4 103 1 103 108
5 104 1 103 108
6 105 1 103 108
7 106 1 103 108
8 107 1 103 108
9 108 1 103 108
10 109 2 109 109
11 110 3 110 112
12 111 3 110 112
13 112 3 110 112
14 113 4 103 116
15 114 4 103 116
16 115 4 103 116
17 116 4 103 116
18 117 5 117 123
19 118 5 117 123
20 119 5 117 123
21 120 5 117 123
22 121 5 117 123
23 122 5 117 123
24 123 5 117 123
25 124 4 124 128
26 125 4 124 128
27 126 4 124 128
28 127 4 124 128
29 128 4 124 128
CodePudding user response:
We may use rleid
on 'occ' to create a grouping column before we create the first
and last
from the 'nuc' column for 'start', 'stop' columns
library(dplyr)
library(data.table)
tel2 %>%
group_by(grp = rleid(occ)) %>%
mutate(start = first(nuc), stop = last(nuc)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 101 × 4
nuc occ start stop
<int> <int> <int> <int>
1 100 0 100 102
2 101 0 100 102
3 102 0 100 102
4 103 1 103 108
5 104 1 103 108
6 105 1 103 108
7 106 1 103 108
8 107 1 103 108
9 108 1 103 108
10 109 2 109 109
# … with 91 more rows
CodePudding user response:
You can group_by
unique occurrences of occ
by calculating the cumsum
of it if it is different from the lead
value (to account for repeating occ
values).
library(tidyverse)
tel2 %>%
group_by(tmp = c(0, na.omit(cumsum(lead(occ) != occ)))) %>%
mutate(start = first(nuc),
end = last(nuc)) %>%
ungroup() %>%
select(-tmp)
# A tibble: 101 × 4
nuc occ start end
<int> <int> <int> <int>
1 100 0 100 102
2 101 0 100 102
3 102 0 100 102
4 103 1 103 108
5 104 1 103 108
6 105 1 103 108
7 106 1 103 108
8 107 1 103 108
9 108 1 103 108
10 109 2 109 109
11 110 3 110 112
12 111 3 110 112
13 112 3 110 112
14 113 4 113 116
15 114 4 113 116
16 115 4 113 116
17 116 4 113 116
18 117 5 117 123
19 118 5 117 123
20 119 5 117 123