I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to first consecutive numbers in each column.
My data looks something like this:
y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)
# A tibble: 6 × 3
y1 y2 z
<dbl> <dbl> <dbl>
1 1 0 200
2 2 0 250
3 3 0 200
4 8 4 100
5 9 5 90
6 0 6 80
I tried the following formula
yabc %>%
mutate_at(vars(starts_with("y")),
list(mod = ~ cumsum(c(FALSE, diff(.x)!=1)) 1))
that gave me the following result:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 1
2 2 0 250 1 2
3 3 0 200 1 3
4 8 4 100 2 4
5 9 5 90 2 4
6 0 6 80 3 4
I am only interested in numbers greater than zero. I tried replacing zeros with NA
, but it did not work either.
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 200 1 1
2 2 NA 250 1 NA
3 3 NA 200 1 NA
4 8 4 100 2 NA
5 9 5 90 2 NA
6 NA 6 80 NA NA
What I would like the data to look like is:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
Is there any way to exclude zeros and start applying the formula only when .x
is greater than 0? Or any other way to make the formula work the way I need? Thank you!
CodePudding user response:
FYI: mutate_at
has been superseded by across
, I'll demonstrate the new method in my code.
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) 1L,
NA_integer_) )
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 2
# 5 9 5 90 2 2
# 6 0 6 80 NA 2
If this is sufficient (you don't care if it's 1
or 2
for the first effective group in y2_mod
), then you're good. If you want to reduce them all to be 1-based, then
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
NA_integer_))),
across(ends_with("_mod"),
~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 1
# 5 9 5 90 2 1
# 6 0 6 80 NA 1
Notes:
if_else
is helpful to handle theNA
-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" ofNA
as thefalse=
(third) argument toif_else
. For example,cumsum(.) 1
produces anumeric
, so the third arg would need to beNA_real_
(since the defaultNA
is actuallylogical
). Another way to deal with it is to either usecumsum(.) 1L
(produces an integer) andNA_integer_
or (as I show in my second example) usecumsum(.)
by itself (andNA_integer_
) since wematch
things later (andmatch(.)
returnsinteger
)- I demo the shift from your
mutate_at
tomutate(across(..))
. An important change here frommutate
is that we runacross
without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of.names
; that takes aglue
-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no.names
) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions withinacross(..)
.
CodePudding user response:
library(data.table)
library(tidyverse)
yabc %>%
mutate(across(starts_with('y'),
~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))),
.names = '{col}_mod'))
# A tibble: 6 x 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <int> <int>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:
ie consider:
x <- c(1,2,3,6,7,8,10,11,12)
The consecutive numbers can be grouped as:
x - seq_along(x)
[1] 0 0 0 2 2 2 3 3 3
As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle
rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3