I am trying to apply a custom function over a group of IDs only when some conditions are met for each column of that group (data should be only numeric and their sum non-zero). Here is the reproducible example:
dat <- as.Date("2021/08/04")
len <- 5
seq(dat, by = "day", length.out = len)
input <- data.frame(
date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
var1 = c(2,3,4,6,7,8,9,3,5,6),
var2 = c(0, 0, 0, 0, 0, 1, 2, 3 ,4, 5),
var3 = c("hi", "hi", "hi", "hi", "hi", 1, 2, 3 ,4, 5)
)
Here is my custom Rescale function:
rescale = function(x,max_range=100){
return(((x-min(x))/(max(x)-min(x)))*max_range)
}
And this is the desired output:
output <- data.frame(
date = c(seq(dat, by = "day", length.out = len) , seq(dat, by = "day", length.out = len)),
id = c("aa", "aa","aa","aa","aa","bb","bb","bb","bb","bb"),
var1 = c(0, 20, 40, 80, 100, 83.3, 100, 0, 33.3, 50),
var2 = c(0, 0, 0, 0, 0, 0, 25, 50 ,75, 100),
var3 = c("hi", "hi", "hi", "hi", "hi", 0, 25, 50 ,75, 100)
)
I am using the following lines to solve this, using dplyr:
out = input %>%
dplyr::group_by(id) %>%
dplyr::mutate_if(~is.numeric(.) && sum(.x) != 0 ,rescale) %>%
dplyr::arrange(date, .by_group = TRUE) %>%
dplyr::ungroup()
The problem with these lines is that conditions do not refer to the columns of each group exclusively, but to the whole column of the input
table. Hence, the function is applied to the whole table-column although the conditions are met only for one ID. In this example function was applied for aa~var2
(which is not desired) and wasn't applied for bb~var3
(which is desired).
Could you please help me correct the code of out
? Thank you.
CodePudding user response:
This gives you the desired output, but it is much harder than it needs to be because you want to preserve the "hi" values. A numeric column cannot have text in it, so you have to convert to numeric, handle NA values, rescale the non-NA values, then convert to character and rewrite the "hi" values in place. Furthermore, you need to go back at the end and re-convert the columns without "hi" in them back to numeric. You could avoid all this by having NA values instead of "hi", but anyway, if you really have to preserve the "hi" values, you can do:
library(dplyr)
input %>%
group_by(id) %>%
mutate(across(contains("var"), function(x) {
x_n <- suppressWarnings(as.numeric(x))
if(all(is.na(x_n))) return(x)
if(all(x_n == 0)) return(as.character(x_n))
x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
if(any(is.na(x_n))) {
x_n <- as.character(x_n)
x_n[is.na(x_n)] <- x[is.na(x_n)]
}
as.character(x_n)
})) %>%
ungroup() %>%
mutate(across(contains("var"), function(x) {
if(any(is.na(suppressWarnings(as.numeric(x))))) x else as.numeric(x)
}))
#> # A tibble: 10 x 5
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <chr>
#> 1 2021-08-04 aa 0 0 hi
#> 2 2021-08-05 aa 20 0 hi
#> 3 2021-08-06 aa 40 0 hi
#> 4 2021-08-07 aa 80 0 hi
#> 5 2021-08-08 aa 100 0 hi
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
If you are prepared to have NA values instead of "hi" (and thereby have numeric columns that you can actually perform calculations on), you can simplify to
input %>%
group_by(id) %>%
mutate(across(contains("var"), function(x) {
x_n <- suppressWarnings(as.numeric(x))
if(all(is.na(x_n))) return(NA)
if(all(x_n == 0)) return(x_n)
x_n[!is.na(x_n)] <- rescale(x_n[!is.na(x_n)])
x_n
}))
#> # A tibble: 10 x 5
#> # Groups: id [2]
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2021-08-04 aa 0 0 NA
#> 2 2021-08-05 aa 20 0 NA
#> 3 2021-08-06 aa 40 0 NA
#> 4 2021-08-07 aa 80 0 NA
#> 5 2021-08-08 aa 100 0 NA
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
Edit
Removing the complication of having a non-numeric column altogether by having var3 = c(1, 2, 3 ,4, 5,1, 2, 3 ,4, 5)
as suggested in the comments by the OP makes this far easier:
input %>%
group_by(id) %>%
mutate(across(contains("var"), ~ if(all(.x == 0)) .x else rescale(.x)))
#> # A tibble: 10 x 5
#> # Groups: id [2]
#> date id var1 var2 var3
#> <date> <chr> <dbl> <dbl> <dbl>
#> 1 2021-08-04 aa 0 0 0
#> 2 2021-08-05 aa 20 0 25
#> 3 2021-08-06 aa 40 0 50
#> 4 2021-08-07 aa 80 0 75
#> 5 2021-08-08 aa 100 0 100
#> 6 2021-08-04 bb 83.3 0 0
#> 7 2021-08-05 bb 100 25 25
#> 8 2021-08-06 bb 0 50 50
#> 9 2021-08-07 bb 33.3 75 75
#> 10 2021-08-08 bb 50 100 100
CodePudding user response:
rescale = function(x,max_range=100){
if(min(x) == max(x)) return(x)
return(((x-min(x))/(max(x)-min(x)))*max_range)
}
input %>%
group_by(id)%>%
mutate(across(where(is.numeric), rescale))
# A tibble: 10 × 5
# Groups: id [2]
date id var1 var2 var3
<date> <chr> <dbl> <dbl> <chr>
1 2021-08-04 aa 0 0 hi
2 2021-08-05 aa 20 0 hi
3 2021-08-06 aa 40 0 hi
4 2021-08-07 aa 80 0 hi
5 2021-08-08 aa 100 0 hi
6 2021-08-04 bb 83.3 0 1
7 2021-08-05 bb 100 25 2
8 2021-08-06 bb 0 50 3
9 2021-08-07 bb 33.3 75 4
10 2021-08-08 bb 50 100 5