Home > Back-end >  case_when fails when condition checks for rows that don't exist
case_when fails when condition checks for rows that don't exist

Time:05-07

Consider this data:

df <- data.frame(group  = c(1, 2, 2, 2),
                 start  = c(2, 7, 7, 7),
                 stop   = c(8, 7, 8, 9),
                 unstop = c(10, 7, 9, 10))

I now want to setup a more or less simple case_when per group in the form of "if first row do this, if second row do that". However, I'm getting an error. I assume it's because group 1 does only have one row so the condition can't be checked:

df |>
  group_by(group) |> 
  mutate(n_rows = n(),
         split_weeks = case_when(n_rows == 1 ~ str_c(start:stop, collapse = ","),
                                 n_rows  > 1 & row_number() == 1 ~ str_c(c(start:stop, unstop:lead(stop)), collapse = ","),
                                 TRUE ~ "fail"))

Error in `mutate()`:
! Problem while computing `split_weeks = case_when(...)`.
ℹ The error occurred in group 1: group = 1.
Caused by error in `unstop:lead(stop)`:
! NA/NaN argument
Run `rlang::last_error()` to see where the error occurred.

Any idea what's going on here?

I assume it has to do with the lead function, because if I remove that part, I "only get warnings, but at least I get a result.

Expected output:

# A tibble: 4 × 6
# Groups:   group [2]
  group start  stop unstop n_rows split_weeks  
  <dbl> <dbl> <dbl>  <dbl>  <int> <chr>        
1     1     2     8     10      1 2,3,4,5,6,7,8
2     2     7     7      7      3 7,8          
3     2     7     8      9      3 fail         
4     2     7     9     10      3 fail         

CodePudding user response:

Here's an alternative that yields the desired output (at least in this case). @Andy Baxter provided a good explanation about why the original fails out; even though case_when uses the result of the first case, the 2nd case throws an error so the operation fails. You could get around this by using lead(stop, default = 0) or perhaps coalesce(lead(stop), SOMETHING), either of which would produce a calculable (if meaningless/unneeded) result when there's no "next" value.

df |>
  group_by(group) |> 
  mutate(n_rows = n()) %>%
  mutate(split_weeks = case_when(
    n_rows == 1 ~ str_c(start:stop, collapse = ","),
    n_rows  > 1 & row_number() == 1 ~ str_c(unstop:(lead(stop, default = 0)), collapse = ","),
    # n_rows  > 1 & row_number() == 1 ~ str_c(unstop:(coalesce(lead(stop), unstop)), collapse = ","), # Alternative
    TRUE ~ "fail"))

Result

# A tibble: 4 × 6
# Groups:   group [2]
  group start  stop unstop n_rows split_weeks  
  <dbl> <dbl> <dbl>  <dbl>  <int> <chr>        
1     1     2     8     10      1 2,3,4,5,6,7,8
2     2     7     7      7      3 7,8          
3     2     7     8      9      3 fail         
4     2     7     9     10      3 fail   

CodePudding user response:

I think the errors are coming up when you ask R to find a lead either a) beyond the end of the table rows or b) outside of the group. You could pass it a default value of 0, which is never used and supresses the error, but only gets halfway there as the function is trying to concatenate every start:stop and unstop:lead(stop) value across all rows per group:

library(tidyverse)

df <- data.frame(group  = c(1, 2, 2, 2),
                 start  = c(2, 7, 7, 7),
                 stop   = c(8, 7, 8, 9),
                 unstop = c(10, 7, 9, 10))


df |>
  group_by(group) |>
  mutate(
    n_rows = n(),
    split_weeks = case_when(
      n_rows == 1 ~ str_c(start:stop, collapse = ","),
      n_rows  > 1 &
        row_number() == 1 ~ str_c(c(start:stop, unstop:lead(stop, default = 0)), collapse = ","),
      TRUE ~ "fail"
    )
  )
#> Warning in start:stop: numerical expression has 3 elements: only the first used

#> Warning in start:stop: numerical expression has 3 elements: only the first used

#> Warning in start:stop: numerical expression has 3 elements: only the first used

#> Warning in start:stop: numerical expression has 3 elements: only the first used
#> Warning in unstop:lead(stop, 0): numerical expression has 3 elements: only the
#> first used

#> Warning in unstop:lead(stop, 0): numerical expression has 3 elements: only the
#> first used
#> # A tibble: 4 × 6
#> # Groups:   group [2]
#>   group start  stop unstop n_rows split_weeks  
#>   <dbl> <dbl> <dbl>  <dbl>  <int> <chr>        
#> 1     1     2     8     10      1 2,3,4,5,6,7,8
#> 2     2     7     7      7      3 7,7          
#> 3     2     7     8      9      3 fail         
#> 4     2     7     9     10      3 fail

A way of tidying up would be:

  • finding lead values outside of groups (with default to avoid error in last row)
  • find number of rows and row numbers within groups
  • ungroup again!
  • do rowwise calculations so that R only focuses on values in that row
  • perform concatenation

Which ends up as so, though not sure why 7,7,8 is being put in that cell (makes sense as it's concatenating 7 to 7 and 7 to 8):

df |> 
  mutate(lead_stop = lead(stop, default = 0)) |>
  group_by(group) |>
  mutate(
    n_rows = n(),
    rownum = row_number()) |>
  ungroup() |>
  rowwise() |>
  mutate(
    split_weeks = case_when(
      rownum > 1 ~ "fail",
      n_rows == 1 ~ str_c(start:stop, collapse = ","),
      n_rows  > 1 & rownum == 1 ~ str_c(c(start:stop, unstop:lead_stop), collapse = ","),
      TRUE ~ "fail"
    )
  )
#> # A tibble: 4 × 8
#> # Rowwise: 
#>   group start  stop unstop lead_stop n_rows rownum split_weeks  
#>   <dbl> <dbl> <dbl>  <dbl>     <dbl>  <int>  <int> <chr>        
#> 1     1     2     8     10         7      1      1 2,3,4,5,6,7,8
#> 2     2     7     7      7         8      3      1 7,7,8        
#> 3     2     7     8      9         9      3      2 fail         
#> 4     2     7     9     10         0      3      3 fail

Created on 2022-05-07 by the reprex package (v2.0.1)

  • Related