Home > Mobile >  How to apply a function to each group of IDs based on conditions with dplyr
How to apply a function to each group of IDs based on conditions with dplyr

Time:10-21

I am trying to apply a custom function over a group of IDs only when some conditions are met for each column of that group (data should be only numeric and their sum non-zero). Here is the reproducible example:

dat <- as.Date("2021/08/04")
len <- 5
seq(dat, by = "day", length.out = len)

input <- data.frame(
  date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
  id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
  var1 = c(2,3,4,6,7,8,9,3,5,6),
  var2 = c(0, 0, 0, 0, 0, 1, 2, 3 ,4, 5),
  var3 = c("hi", "hi", "hi", "hi", "hi", 1, 2, 3 ,4, 5) 
)

Here is my custom Rescale function:

rescale = function(x,max_range=100){
  return(((x-min(x))/(max(x)-min(x)))*max_range)
}

And this is the desired output:

output <- data.frame(
  date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
  id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
  var1 = c(0, 20, 40, 80, 100, 83.3, 100, 0, 33.3, 50),
  var2 = c(0, 0, 0, 0, 0, 0, 25, 50 ,75, 100),
  var3 = c("hi", "hi", "hi", "hi", "hi", 0, 25, 50 ,75, 100)
)

I am using the following lines to solve this, using dplyr:

out = input %>%
      dplyr::group_by(id) %>%
      dplyr::mutate_if(~is.numeric(.) && sum(.x) != 0 ,rescale) %>%
      dplyr::arrange(date, .by_group = TRUE)  %>%
      dplyr::ungroup()

The problem with these lines is that conditions do not refer to the columns of each group exclusively, but to the whole column of the input table. Hence, the function is applied to the whole table-column although the conditions are met only for one ID. In this example function was applied for aa~var2 (which is not desired) and wasn't applied for bb~var3 (which is desired).

Could you please help me correct the code of out ? Thank you.

CodePudding user response:

This gives you the desired output, but it is much harder than it needs to be because you want to preserve the "hi" values. A numeric column cannot have text in it, so you have to convert to numeric, handle NA values, rescale the non-NA values, then convert to character and rewrite the "hi" values in place. Furthermore, you need to go back at the end and re-convert the columns without "hi" in them back to numeric. You could avoid all this by having NA values instead of "hi", but anyway, if you really have to preserve the "hi" values, you can do:

library(dplyr)

input %>%
  group_by(id) %>%
  mutate(across(contains("var"), function(x) {
    x_n <- suppressWarnings(as.numeric(x))
    if(all(is.na(x_n))) return(x)
    if(all(x_n == 0)) return(as.character(x_n))
    x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
    if(any(is.na(x_n))) {
      x_n <- as.character(x_n)
      x_n[is.na(x_n)] <- x[is.na(x_n)]
    }
    as.character(x_n)
  })) %>%
  ungroup() %>%
  mutate(across(contains("var"), function(x) {
    if(any(is.na(suppressWarnings(as.numeric(x))))) x else as.numeric(x)
  }))
#> # A tibble: 10 x 5
#>    date       id     var1  var2 var3 
#>    <date>     <chr> <dbl> <dbl> <chr>
#>  1 2021-08-04 aa      0       0 hi   
#>  2 2021-08-05 aa     20       0 hi   
#>  3 2021-08-06 aa     40       0 hi   
#>  4 2021-08-07 aa     80       0 hi   
#>  5 2021-08-08 aa    100       0 hi   
#>  6 2021-08-04 bb     83.3     0 0    
#>  7 2021-08-05 bb    100      25 25   
#>  8 2021-08-06 bb      0      50 50   
#>  9 2021-08-07 bb     33.3    75 75   
#> 10 2021-08-08 bb     50     100 100  

If you are prepared to have NA values instead of "hi" (and thereby have numeric columns that you can actually perform calculations on), you can simplify to

input %>%
  group_by(id) %>%
  mutate(across(contains("var"), function(x) {
    x_n <- suppressWarnings(as.numeric(x))
    if(all(is.na(x_n))) return(NA)
    if(all(x_n == 0)) return(x_n)
    x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
    x_n
  })) 
#> # A tibble: 10 x 5
#> # Groups:   id [2]
#>    date       id     var1  var2  var3
#>    <date>     <chr> <dbl> <dbl> <dbl>
#>  1 2021-08-04 aa      0       0    NA
#>  2 2021-08-05 aa     20       0    NA
#>  3 2021-08-06 aa     40       0    NA
#>  4 2021-08-07 aa     80       0    NA
#>  5 2021-08-08 aa    100       0    NA
#>  6 2021-08-04 bb     83.3     0     0
#>  7 2021-08-05 bb    100      25    25
#>  8 2021-08-06 bb      0      50    50
#>  9 2021-08-07 bb     33.3    75    75
#> 10 2021-08-08 bb     50     100   100

Edit

Removing the complication of having a non-numeric column altogether by having var3 = c(1, 2, 3 ,4, 5,1, 2, 3 ,4, 5) as suggested in the comments by the OP makes this far easier:

input %>%
  group_by(id) %>%
  mutate(across(contains("var"), ~ if(all(.x == 0)) .x else rescale(.x)))
#> # A tibble: 10 x 5
#> # Groups:   id [2]
#>    date       id     var1  var2  var3
#>    <date>     <chr> <dbl> <dbl> <dbl>
#>  1 2021-08-04 aa      0       0     0
#>  2 2021-08-05 aa     20       0    25
#>  3 2021-08-06 aa     40       0    50
#>  4 2021-08-07 aa     80       0    75
#>  5 2021-08-08 aa    100       0   100
#>  6 2021-08-04 bb     83.3     0     0
#>  7 2021-08-05 bb    100      25    25
#>  8 2021-08-06 bb      0      50    50
#>  9 2021-08-07 bb     33.3    75    75
#> 10 2021-08-08 bb     50     100   100

CodePudding user response:

  rescale = function(x,max_range=100){
    if(min(x) == max(x)) return(x)
    return(((x-min(x))/(max(x)-min(x)))*max_range)
  }


input %>%
  group_by(id)%>%
  mutate(across(where(is.numeric), rescale))
# A tibble: 10 × 5
# Groups:   id [2]
   date       id     var1  var2 var3 
   <date>     <chr> <dbl> <dbl> <chr>
 1 2021-08-04 aa      0       0 hi   
 2 2021-08-05 aa     20       0 hi   
 3 2021-08-06 aa     40       0 hi   
 4 2021-08-07 aa     80       0 hi   
 5 2021-08-08 aa    100       0 hi   
 6 2021-08-04 bb     83.3     0 1    
 7 2021-08-05 bb    100      25 2    
 8 2021-08-06 bb      0      50 3    
 9 2021-08-07 bb     33.3    75 4    
10 2021-08-08 bb     50     100 5 
  • Related