I want to replace all number range by their mean. For example:
DT <- data.table(raw.data = c("1-2 min", "3-5 min or 2-4 min"),
desirable.data = c("1.5 min", "4 min or 3 min"))
Now, I can do this by
gsubfn(pattern = "(\\d )(-)(\\d )", ~ mean(c(as.numeric(..1), as.numeric(..3))), x = DT$raw.data)
Still, I'm wondering if there is a better solution. Because this approach returned an error "Variable '1' is not found in calling scope." when used in data.table, thanks!
More generally speaking, when I tried to replace certain part of the string by regex, I wanted to replace the "match" part according to the "group", rather than just some known string. But it seemed that functions like str_replace_all doesn't work this way. The argument "replacement" only allow some string as input.
Update
DT[, new := gsubfn(pattern = "(\\d )(-)(\\d )",
~ mean(c(as.numeric(..1), as.numeric(..3))),
x = raw.data)]
When trying to use gsubfn in data.table, there's an error:
Error in [.data.table
(DT, , :=
(new, gsubfn(pattern = "(\d )(-)(\d )", :
Variable '1' is not found in calling scope. Looking in calling scope because this symbol was prefixed with .. in the j= parameter.
CodePudding user response:
You can build string representations of R expressions and evaluate them using str2expression
and eval
.
# replace ranges with calls to mean() and some quotes for string building
expr.str <- gsub("(\\d )-(\\d )", "', mean(c(\\1, \\2)), '", DT$raw.data)
# remove the extra ", " at the beginning
expr.str <- sub("^', ", "", expr.str)
# wrap in paste0() while adding the final single quote
expr.str <- paste0("paste0(", expr.str, "')")
# [1] "paste0(mean(c(1, 2)), ' min')"
# [2] "paste0(mean(c(3, 5)), ' min or ', mean(c(2, 4)), ' min')"
# create expressions and evaluate them
DT$data.mean <- sapply(str2expression(expr.str), eval)
# raw.data data.mean
# 1: 1-2 min 1.5 min
# 2: 3-5 min or 2-4 min 4 min or 3 min
CodePudding user response:
The code in the question does not give any error so we assume that the code being referred to is the code in the Note at the end, shown in reproducible form.
It seems the incompatibility with data.table is only in ..1 and ..2. Try either of these one-line solutions instead. The dot dot dot refers to all the arguments or x and y (or any variables not defined in the body) refer to the arguments in the order encountered in the body. We have removed the parentheses around minus since that character string is not used in the function.
DT[, d := gsubfn("(\\d )-(\\d )", ~ mean(as.numeric(c(...))), raw.data)]
DT[, d := gsubfn("(\\d )-(\\d )", ~ mean(c(as.numeric(c(x, y)))), raw.data)]
Note
Assumed code in reproducible form.
library(data.table)
library(gstufn)
DT <- data.table(raw.data = c("1-2 min", "3-5 min or 2-4 min"),
desirable.data = c("1.5 min", "4 min or 3 min"))
DT, d := gsubfn(pattern = "(\\d )(-)(\\d )", ~ mean(c(as.numeric(..1), as.numeric(..3))), x = DT$raw.data)]
CodePudding user response:
What seems like a petty task is in fact a highly complex one (at least the way I see it; perhaps others can simplify this approach or come up with a simpler one in the first place). The problem is posed by the "or" value (if that were'nt there, things would be pretty easy). Therefore I propose a two-step procedure: first deal with the rows containing "or", then put the results back into the original data frame and deal with the rest:
DT %>%
### STEP 1:
# separate out the row(s) with "or":
filter(str_detect(raw.data, "or")) %>%
# give each range a separate row:
separate_rows(raw.data, sep = " or ") %>%
mutate(
# extract digits, convert to numeric, and calculate mean:
Mean = lapply(lapply(str_extract_all(raw.data, "\\d "), as.numeric), mean)) %>%
# unnest list:
unnest(Mean) %>%
# paste separated elements back together:
summarise(
raw.data = str_c(raw.data, collapse = " "),
Mean = str_c(Mean, collapse = " ")) %>%
mutate(
# add "min":
Mean = gsub("(\\d )\\s(\\d )","\\1 min or \\2 min", Mean)) %>%
### STEP 2:
# bind result back together with DT extecpt for wos with "or" value:
bind_rows(DT %>% filter(!str_detect(raw.data, "or")), .) %>%
mutate(
# if `mean` is NA, repeat procedure for mean calculation from above:
Mean = ifelse(is.na(Mean),
lapply(lapply(str_extract_all(raw.data, "\\d "), as.numeric), mean),
Mean),
# add "min"
Mean = sub("(\\d (\\.\\d )?)$", "\\1 min", Mean))
raw.data Mean
1: 1-2 min 1.5 min
2: 3-5 min 2-4 min 4 min or 3 min