Home > Software engineering >  dplyr mutate based on columns condition and an external vector
dplyr mutate based on columns condition and an external vector

Time:09-27

I am trying to add a list column to a tibble data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.

Suppose that the data frame and the vector are the following:

library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
              B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))

vec <- c(12, 25, 28, 58, 98)

Now, I would like to add (mutate) the column y so that for each row y is a list containing the elements of vec between A and B (inclusive).

The not-so-proper way to do this would be via loop. I initialize the column y as list and update it row-wise based on the condition A <= vec & vec <= B:

dat %<>% 
  mutate(y = list(vec))

for (i in 1:nrow(dat)){
  dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec &  vec <= dat[i,]$B])
}

The result is a data frame with y being a list of dbl of variable length:

> dat
# A tibble: 9 x 3
      A     B y        
  <dbl> <dbl> <list>   
1    12    68 <dbl [4]>
2    27    46 <dbl [1]>
3    22    69 <dbl [3]>
4     1     7 <dbl [0]>
5    15    44 <dbl [2]>
6    30    76 <dbl [1]>
7    20    72 <dbl [3]>
8    28    50 <dbl [1]>
9    19    51 <dbl [2]>

The first four values of y are:

[[1]]
[1] 12 25 28 58

[[2]]
[1] 28

[[3]]
[1] 25 28 58

[[4]]
numeric(0)

Note: the 4-th list is empty, because no value of vec is between A=1 and B=7.

I have tried as an intermediate step with getting the subscripts via which using mutate(y = list(which(A <= vec & vec <= B))) or with a combination of seq and %in%, for instance mutate(y = list(vec %in% seq(A, B))). These both give an error. However, I don't need the subscripts, I need a subset of vec.

CodePudding user response:

Create a small helper function with the logic that you want to implement.

return_values_in_between <- function(vec, A, B) {
  vec[A <= vec & vec <= B]
}

and call the function for each row (using rowwise) -

library(dplyr)

result <- dat %>%
  rowwise() %>%
  mutate(y = list(return_values_in_between(vec, A, B))) %>%
  ungroup()

result

# A tibble: 9 × 3
#      A     B    y        
#   <dbl> <dbl> <list>   
#1    12    68 <dbl [4]>
#2    27    46 <dbl [1]>
#3    22    69 <dbl [3]>
#4     1     7 <dbl [0]>
#5    15    44 <dbl [2]>
#6    30    76 <dbl [1]>
#7    20    72 <dbl [3]>
#8    28    50 <dbl [1]>
#9    19    51 <dbl [2]>

Checking the first 4 values in result$y -

result$y
#[[1]]
#[1] 12 25 28 58

#[[2]]
#[1] 28

#[[3]]
#[1] 25 28 58

#[[4]]
#numeric(0)

#...
#...

CodePudding user response:

With the help of @Ronak Shah, I was able to come up with a solution that doesn't require a dedicated function and also makes sure that the vec is pulled from the global environment (in case there might be a column vec in the data frame):

library(tidyverse)
dat |> 
  rowwise() |> 
  mutate(y = list(.GlobalEnv$vec[.GlobalEnv$vec >= A & .GlobalEnv$vec <= B])) |> 
  ungroup()
  • Related