Home > Net >  case_when fails (unexpected results) but conditions for case_when work outside
case_when fails (unexpected results) but conditions for case_when work outside

Time:04-21

I'm trying to apply some equations to get the proportion of a line (numeric array) that coincides with another line (another numeric array). I have a data frame with the required values and I try to create a new column with the percentage outcome based on how the two lines coincide. I have checked the code (below) with some examples and it works OK, but when I apply the case_when() to the data frame, the output is not what it should be. I'll give you a basic example.

This is my output. It has a 'ID' column [char], a 'date' (days) column [dttm], a 'result' (value) column [double], 'difs' column is the number of days between the previous row [int], and 'Grp' column, that is a subgrouping value.

enter image description here

Here is the code I'm using. The idea is to get the previous value and calculate what % of the vector lies inside another vector whose limits are [2,3]. Right now I'm only checking if the conditions are right for every row. However, when it should get '0', get 'A', or sometimes 'Inf' when it should get 'A', etc, etc. I don't understand why. I think mutate iterate independently over every row inside the group, so I don't understand why the outcomes are wrong, compared to checking by hand.

Rsup = 3 # Highlimit of target array
Rinf = 2 # Low limit of target array

example_output = example%>% 
  arrange(id,Grp,day) %>% 
  group_by(id,Grp) %>% # Group by episodes (id   Grp)
  mutate(from_r = lag(result)) %>% # get previous result y(t-1)
  filter(difs != 0, difs < 181) %>% # dischard first sample of every subgroup/episode
  mutate( 
    p_days = case_when(
    (min(result,from_r) < Rinf) & (max(result,from_r) > Rsup) ~ 'A',  
    (min(result,from_r) > Rinf) & (max(result,from_r) < Rsup) ~ '100',
    (min(result,from_r) < Rinf) & (max(result,from_r) > Rinf) ~ 'Inf',
    (min(result,from_r) < Rsup) & (max(result,from_r) > Rsup) ~ 'Sup',
    TRUE ~ '0')
  )
# Case 'A':  check interval yt - yt-1 cuts target array for both limits
# Case '100': all the interval yt - yt-1 is inside target array (100%)
# Case 'Inf': interval cuts low limit of target array
# Case 'Sup': interval cuts high limit of target array
# Case True ~ '0': interval does not cut target array and it is not inside (0%)

This is how to create the basic example:

structure(list(id = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B"), day = structure(c(19104, 19105, 
19106, 19107, 19108, 19109, 19110, 19111, 19112, 19113, 19304, 
19305, 19306, 19307, 19604, 19605, 19606, 19607, 19608, 19609, 
19204, 19205, 19206, 19207, 19208, 19209, 19210, 19211, 19212, 
19213, 19214, 19215, 19216, 19217, 19218, 19219, 19220, 19221, 
19222, 19223), class = "Date"), result = c(1.55, 1.92, 3.6, 3.45, 
3.3, 3.46, 2.79, 2.55, 2.08, 2.27, 2.44, 4.59, 1.8, 0.75, 3.13, 
2.59, 2.16, 2.93, 1.38, 2.92, 3.19, 3.23, 3.48, 3.39, 2.62, 2.66, 
3.77, 3.44, 3.06, 2.59, 2.87, 1.97, 2.5, 2.84, 1.48, 3.04, 2.62, 
0.76, 2.74, 2.84), difs = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 191, 
1, 1, 1, 297, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1), Grp = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -40L), groups = structure(list(
    id = c("A", "B"), .rows = structure(list(1:20, 21:40), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Of course, if someone knows a function to get the same output I'm trying with mutate case_when, it will be very helpful too. Thanks in advance.

EDIT: I think mutate iterate independently over every row inside the group, so I don't understand why the outcomes are wrong. Maybe it mixes the results (and from_r) values of every group somehow?

CodePudding user response:

The issue is that you use min/max instead of the vectorized pmin/pmax:

library(dplyr)

ex1 <- example %>% 
  arrange(id,Grp,day) %>% 
  group_by(id,Grp) %>% # Group by episodes (id   Grp)
  mutate(from_r = lag(result)) %>% # get previous result y(t-1)
  filter(difs != 0, difs < 181) # dischard first sample of every subgroup/episode

ex1 %>%
  mutate( 
    p_days = case_when(
      (pmin(result,from_r) < Rinf) & (pmax(result,from_r) > Rsup) ~ 'A',  
      (pmin(result,from_r) > Rinf) & (pmax(result,from_r) < Rsup) ~ '100',
      (pmin(result,from_r) < Rinf) & (pmax(result,from_r) > Rinf) ~ 'Inf',
      (pmin(result,from_r) < Rsup) & (pmax(result,from_r) > Rsup) ~ 'Sup',
      TRUE ~ '0')
  )
#> # A tibble: 36 × 7
#> # Groups:   id, Grp [4]
#>    id    day        result  difs   Grp from_r p_days
#>    <chr> <date>      <dbl> <dbl> <dbl>  <dbl> <chr> 
#>  1 A     2022-04-23   1.92     1     1   1.55 0     
#>  2 A     2022-04-24   3.6      1     1   1.92 A     
#>  3 A     2022-04-25   3.45     1     1   3.6  0     
#>  4 A     2022-04-26   3.3      1     1   3.45 0     
#>  5 A     2022-04-27   3.46     1     1   3.3  0     
#>  6 A     2022-04-28   2.79     1     1   3.46 Sup   
#>  7 A     2022-04-29   2.55     1     1   2.79 100   
#>  8 A     2022-04-30   2.08     1     1   2.55 100   
#>  9 A     2022-05-01   2.27     1     1   2.08 100   
#> 10 A     2022-11-09   4.59     1     2   2.44 Sup   
#> # … with 26 more rows

To see the difference more clearly check:

min(ex1$result,ex1$from_r) < Rinf
#> [1] TRUE
pmin(ex1$result,ex1$from_r) < Rinf
#>  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#> [13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

As you see, using min the condition min(ex1$result,ex1$from_r) < Rinf gives TRUE for each row of your data. If you want to check conditions or set per row you have to use pmin/pmax.

  • Related