I have a vector that contains series of texts and numbers, like:
t <- c("A", 1:3, "A", 1:4, "A", 1:3)
t
#> [1] "A" "1" "2" "3" "A" "1" "2" "3" "4" "A" "1" "2" "3"
Created on 2022-08-06 by the reprex package (v2.0.1)
That is, the actual data is taken from a pdf, with the data frame collapsed into a single column vector, and the wrap length is uneven for some reason (probably because of the cell merging). To process this data efficiently, I want to know the length from "A" to next "A" or end. In this example the answer would be 3, 4, 3 (Edit: sorry for a simple mistake, it would be 4, 5, 4). I have tried many different methods but can't find one that works. Does anyone know of a better way?
CodePudding user response:
An alternative using rle
(run-length encoding)
with(rle(t == "A"), subset(lengths, !values))
#> [1] 3 4 3
CodePudding user response:
You want the number of elements
- (1) between adjacent "A"s;
- (2) from the last "A" (excluding it) to the end.
We can use either of the following:
diff(c(which(t == "A"), length(t) 1)) - 1
#[1] 3 4 3
diff(which(c(t, "A") == "A")) - 1
#[1] 3 4 3
Essentially we pad an "A" at the end to turn (2) into (1). If the last element of t
happens to be an "A", the last value in the result will be 0.
Extension:
If you further want to know the number of elements from the beginning to the first "A" (excluding it), we can pad a leading "A":
diff(c(0, which(t == "A"), length(t) 1)) - 1
#[1] 0 3 4 3
diff(which(c("A", t, "A") == "A")) - 1
#[1] 0 3 4 3
Here, the first value is 0, because the first element of t
happens to be an "A".