Home > Blockchain >  Add two columns simulataneously via mutate
Add two columns simulataneously via mutate

Time:02-16

I would like to use dplyr::mutate to add two named columns to a dataframe simulataneously and with a single function call. Consider the following example

library(dplyr)

n <- 1e2; M <- 1e3
variance <- 1

x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)

dat <- data.frame(s = s, x = x)

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  mean(x)   qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}

ci_studclt(x)

Trying something like the below returns an error, since obviously two values are produced and cannot be inserted into a single atomic-type column.

dat %>% 
  group_by(s) %>% 
  mutate(ci = ci_studclt(x, variance))

It seems one option is to insert a list column then unnest_wider and that this is easier with data.table or the specific case of splitting a string column into two new columns.

In my example, a confidence interval (lower and upper bound) come out of a function and I would like to directly add both as new columns to dat e.g. calling the columns ci_lower and ci_upper.

Is there a straightforward way of doing this with dplyr or do I need to insert the elements as a list column then unnest?

NB Keep in mind that the confidence interval values are a function of a group of simulated values x, grouped by s; the CI values should be constant within a group.

CodePudding user response:

You can do this by having your function (or a wrapper function) return a data.frame. When you call it in mutate, don’t specify a column name (or else you’ll end up with a nested data.frame column). If you want to specify names for the new columns, you can include them as function arguments as in the below.


library(dplyr)

n <- 1e2; M <- 1e3
variance <- 1

x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)

dat <- data.frame(s = s, x = x)

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  mean(x)   qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}

ci_wrapper <- function(x, alpha = 0.05, names_out = c("ci_lower", "ci_upper")) {
  ci <- ci_studclt(x, alpha = alpha)
  out <- data.frame(ci[[1]], ci[[2]])
  names(out) <- names_out
  out
}

# original code was ci_studclt(x, variance)
# but ci_studclt() doesn't take a variance argument, so I omitted
dat %>% 
  group_by(s) %>% 
  mutate(ci_wrapper(x))

output:

# A tibble: 100,000 x 4
# Groups:   s [1,000]
       s       x ci_lower ci_upper
   <int>   <dbl>    <dbl>    <dbl>
 1     1  0.233    -0.223    0.139
 2     1  1.03     -0.223    0.139
 3     1  1.53     -0.223    0.139
 4     1  0.0150   -0.223    0.139
 5     1 -0.211    -0.223    0.139
 6     1 -1.13     -0.223    0.139
 7     1 -1.51     -0.223    0.139
 8     1  0.371    -0.223    0.139
 9     1  1.80     -0.223    0.139
10     1 -0.137    -0.223    0.139
# ... with 99,990 more rows

With specified column names:

dat %>% 
  group_by(s) %>% 
  mutate(ci_wrapper(x, names_out = c("ci.lo", "ci.hi")))

output:

# A tibble: 100,000 x 4
# Groups:   s [1,000]
       s       x  ci.lo ci.hi
   <int>   <dbl>  <dbl> <dbl>
 1     1  0.233  -0.223 0.139
 2     1  1.03   -0.223 0.139
 3     1  1.53   -0.223 0.139
 4     1  0.0150 -0.223 0.139
 5     1 -0.211  -0.223 0.139
 6     1 -1.13   -0.223 0.139
 7     1 -1.51   -0.223 0.139
 8     1  0.371  -0.223 0.139
 9     1  1.80   -0.223 0.139
10     1 -0.137  -0.223 0.139
# ... with 99,990 more rows

CodePudding user response:

If you get your function to return a two-column data frame with repeated values of the same length as the input, then this becomes very easy:

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  res <- mean(x)   qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
  data.frame(lower = rep(res[1], length(x)), upper = res[2])
}

dat %>% 
  group_by(s) %>% 
  mutate(ci_studclt(x))

#> # A tibble: 100,000 x 4
#> # Groups:   s [1,000]
#>        s      x  lower upper
#>    <int>  <dbl>  <dbl> <dbl>
#>  1     1 -0.767 -0.147 0.293
#>  2     1 -0.480 -0.147 0.293
#>  3     1 -1.31  -0.147 0.293
#>  4     1  0.219 -0.147 0.293
#>  5     1  0.650 -0.147 0.293
#>  6     1  0.542 -0.147 0.293
#>  7     1 -0.249 -0.147 0.293
#>  8     1  2.22  -0.147 0.293
#>  9     1 -0.239 -0.147 0.293
#> 10     1  0.176 -0.147 0.293
#> # ... with 99,990 more rows

CodePudding user response:

Other possible variation (if you don't want to change your ci_studclt function) how it can be done:

dat %>% 
  group_by(s) %>% 
  mutate(
    across(x,
           .fns = list(
             lower = ~ci_studclt(.)[1],
             upper = ~ci_studclt(.)[2]
             )
           )
    )

In this case output will also contain new x_lower and x_upper columns. This variant is also somewhat scalable, so if you want to calculate your function over other column y as well, you can just replace x with c(x,y) and have also y_lower and y_upper columns in dat as well.

UPDATE

Actually, all the stuff that Allan did in his answer could be done inside mutate call and without any modification of initial function:

dat %>% 
  group_by(s) %>% 
  mutate(
    t(ci_studclt(x)) %>% 
      as.data.frame() %>% 
      set_names(c('ci_lower','ci_upper'))
    )

We just transpose an output from ci_studclt(x) for treating it as row by data.frame function and give this 1-row dataframe correct names.

  • Related