Home > other >  Conditional creation (mutate) of new columns
Conditional creation (mutate) of new columns

Time:10-26

I have a vector containing "potential" column names:

col_vector <- c("A", "B", "C")

I also have a data frame, e.g.

library(tidyverse)
df <- tibble(A = 1:2,
             B = 1:2)

My goal now is to create all columns mentioned in col_vector that don't yet exist in df.

For the above exmaple, my code below works:

df %>%
  mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)

# A tibble: 2 x 3
      A     B C    
  <int> <int> <lgl>
1     1     1 NA   
2     2     2 NA  

Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.

Example data where code above fails:

df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
              B = 1:2,
              C = 1:2)

CodePudding user response:

This should work.

df[,setdiff(col_vector, colnames(df))] <- NA

CodePudding user response:

Solution

This base solution might be simpler than a dplyr workflow:

library(tidyverse)


# ...
# Code to generate 'df'.
# ...


# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA


# View results
df

Results

Given your sample col_vector and df here

col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)

this solution should yield the following results:

# A tibble: 2 x 3
      A     B C    
  <int> <int> <lgl>
1     1     1 NA   
2     2     2 NA   

Advantages

An advantage of my solution, over the alternative linked above by @geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.

df %>% mutate(
  #####################################
  A = ifelse("A" %in% names(.), A, NA),
  B = ifelse("B" %in% names(.), B, NA),
  C = ifelse("C" %in% names(.), B, NA)

  # ...
  # etc.
  #####################################
)

My solution is by contrast more dynamic

     ##############################
df[, setdiff(col_vector, names(df))] <- NA
     ##############################

if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.

Note

Incredibly, @AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.

  • Related