Home > database >  Ignore zeros and NAs in cumsum
Ignore zeros and NAs in cumsum

Time:02-19

I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to first consecutive numbers in each column.

My data looks something like this:

y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)

# A tibble: 6 × 3
     y1    y2     z
  <dbl> <dbl> <dbl>
1     1     0   200
2     2     0   250
3     3     0   200
4     8     4   100
5     9     5    90
6     0     6    80

I tried the following formula

yabc %>%
  mutate_at(vars(starts_with("y")),
            list(mod = ~ cumsum(c(FALSE, diff(.x)!=1)) 1))

that gave me the following result:

# A tibble: 6 × 5
     y1    y2     z y1_mod y2_mod
  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     1     0   200      1      1
2     2     0   250      1      2
3     3     0   200      1      3
4     8     4   100      2      4
5     9     5    90      2      4
6     0     6    80      3      4

I am only interested in numbers greater than zero. I tried replacing zeros with NA, but it did not work either.

# A tibble: 6 × 5
     y1    y2     z y1_mod y2_mod
  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     1    NA   200      1      1
2     2    NA   250      1     NA
3     3    NA   200      1     NA
4     8     4   100      2     NA
5     9     5    90      2     NA
6    NA     6    80     NA     NA

What I would like the data to look like is:

# A tibble: 6 × 5

     y1    y2     z y1_mod y2_mod
  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     1     0   200      1      NA
2     2     0   250      1      NA
3     3     0   200      1      NA
4     8     4   100      2      1
5     9     5    90      2      1
6     0     6    80      NA     1

Is there any way to exclude zeros and start applying the formula only when .x is greater than 0? Or any other way to make the formula work the way I need? Thank you!

CodePudding user response:

FYI: mutate_at has been superseded by across, I'll demonstrate the new method in my code.

yabc %>%
  mutate(
    across(starts_with("y"),
      list(mod = ~ if_else(.x > 0,
                           cumsum(.x > 0 & c(FALSE, diff(.x) != 1))   1L,
                           NA_integer_) )
    )
  )
# # A tibble: 6 x 5
#      y1    y2     z y1_mod y2_mod
#   <dbl> <dbl> <dbl>  <int>  <int>
# 1     1     0   200      1     NA
# 2     2     0   250      1     NA
# 3     3     0   200      1     NA
# 4     8     4   100      2      2
# 5     9     5    90      2      2
# 6     0     6    80     NA      2

If this is sufficient (you don't care if it's 1 or 2 for the first effective group in y2_mod), then you're good. If you want to reduce them all to be 1-based, then

yabc %>%
  mutate(
    across(starts_with("y"),
      list(mod = ~ if_else(.x > 0,
                           cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
                           NA_integer_))),
    across(ends_with("_mod"),
      ~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
  )
)
# # A tibble: 6 x 5
#      y1    y2     z y1_mod y2_mod
#   <dbl> <dbl> <dbl>  <int>  <int>
# 1     1     0   200      1     NA
# 2     2     0   250      1     NA
# 3     3     0   200      1     NA
# 4     8     4   100      2      1
# 5     9     5    90      2      1
# 6     0     6    80     NA      1

Notes:

  • if_else is helpful to handle the NA-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA as the false= (third) argument to if_else. For example, cumsum(.) 1 produces a numeric, so the third arg would need to be NA_real_ (since the default NA is actually logical). Another way to deal with it is to either use cumsum(.) 1L (produces an integer) and NA_integer_ or (as I show in my second example) use cumsum(.) by itself (and NA_integer_) since we match things later (and match(.) returns integer)
  • I demo the shift from your mutate_at to mutate(across(..)). An important change here from mutate is that we run across without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names; that takes a glue-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..).

CodePudding user response:

library(data.table)
library(tidyverse)

yabc %>%
  mutate(across(starts_with('y'),
           ~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))), 
           .names = '{col}_mod'))
    
# A tibble: 6 x 5
     y1    y2     z y1_mod y2_mod
  <dbl> <dbl> <dbl>  <int>  <int>
1     1     0   200      1     NA
2     2     0   250      1     NA
3     3     0   200      1     NA
4     8     4   100      2      1
5     9     5    90      2      1
6     0     6    80     NA      1

The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:

ie consider:

 x <- c(1,2,3,6,7,8,10,11,12)

The consecutive numbers can be grouped as:

 x - seq_along(x)
 [1] 0 0 0 2 2 2 3 3 3

As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle

rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3
  •  Tags:  
  • r
  • Related