Home > Enterprise >  Adding non existing columns to a data frame using mutate with across in dplyr R
Adding non existing columns to a data frame using mutate with across in dplyr R

Time:08-16

I have a list of column names as follow,

cols <- c('grade', 'score', 'status')

If the data frame doesn't have any of the columns in the cols vector, I want to add that column (values as NA) to the data frame using mutate and across. How to do that?

CodePudding user response:

A base solution:

df[setdiff(cols, names(df))] <- NA

This command can be adapted for pipeline:

df %>%
  `[<-`(, setdiff(cols, names(.)), NA)

#   id score grade status
# 1  1    94    NA     NA
# 2  2    98    NA     NA
# 3  3    93    NA     NA
# 4  4    82    NA     NA
# 5  5    89    NA     NA

Data
set.seed(123)
df <- data.frame(id = 1:5, score = sample(80:100, 5))

CodePudding user response:

A solution using dplyr::mutate()

Suppose that your data frame is diamonds. Then add a tibble data frame that has the same number of columns as the column names (i.e. three columns, in this MWE) to the original data frame (i.e. diamond here).

To create a tibble that contains NA automatically

(Thanks to the comment by Darren Tsai)

To create a tibble that has the same number of columns as the column names, you can first create a matrix that has the same number of columns as the column names by matrix(ncol = length(cols)), and second, transform it to a tibble data frame by as_tibble() and set the column names using .name_repair = ~ cols inside of as_tibble().

The value of each column of the tibble is logicalNA, when the matrix is created. Note but you may prefer one of NA_integer_, NA_real_, NA_complex_, or NA_character_ over NA, if you want mutate these newly added columns later on as integer columns, numeric columns, complex columns (e.g. 1 5i), and character columns, respectively. In such a case, you can mutate the tibble so that you can change the type of column.

You can create such a tibble inside of mutate.

cols <- c('grade', 'score', 'status')

diamonds |>
  mutate(
    matrix(
      ncol = length(cols)
    ) |>
      as_tibble(
        .name_repair = ~ cols
      ) |>
      ## if you want to interpret the grade as `factor` type...
      mutate(
        grade = as.factor(grade)
      )
  )

## # A tibble: 53,940 × 13
##    carat cut       color clarity depth table price     x     y     z grade score
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <lgl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 NA    NA
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 NA    NA   
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 NA    NA
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 NA    NA
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 NA    NA
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 NA    NA
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 NA    NA   
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 NA    NA
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 NA    NA
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 NA    NA
## # … with 53,930 more rows, and 1 more variable: status <lgl>

To create the NA tibble that does not have any column that matches to the original data frame

(Thanks to the comment by Julian)

To ensure that the columns are added to the original data frame only if the original data frame doesn't have any of the columns in the cols vector, you have to select the columns of the NA tibble data frame that are not present in the original data frame. You can do that by using !select(matches(colnames(diamonds))).

cols <- c("grade", "price", "status")

matrix(ncol = length(cols)) |>
  as_tibble(
    .name_repair = ~ cols
  ) |>
  mutate(
    grade = as.factor(grade)
  )

diamonds |>
  mutate(
    matrix(
      ncol = length(cols)
    ) |>
      as_tibble(
        .name_repair = ~cols
      ) |>
      ## if you want to interpret the grade as `factor` type...
      mutate(
        grade = as.factor(grade)
      ) |>
      ## select columns that are not present in the original data frame 
      dplyr::select(
        !matches(colnames(diamonds))
      )
  )

## # A tibble: 53,940 × 12
##    carat cut      color clarity depth table price     x     y     z grade status
##    <dbl> <ord>    <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <lgl> 
##  1  0.23 Ideal    E     SI2      61.5    55   326  3.95  3.98  2.43 NA    NA
##  2  0.21 Premium  E     SI1      59.8    61   326  3.89  3.84  2.31 NA    NA
##  3  0.23 Good     E     VS1      56.9    65   327  4.05  4.07  2.31 NA    NA    
##  4  0.29 Premium  I     VS2      62.4    58   334  4.2   4.23  2.63 NA    NA
##  5  0.31 Good     J     SI2      63.3    58   335  4.34  4.35  2.75 NA    NA
##  6  0.24 Very Go… J     VVS2     62.8    57   336  3.94  3.96  2.48 NA    NA
##  7  0.24 Very Go… I     VVS1     62.3    57   336  3.95  3.98  2.47 NA    NA
##  8  0.26 Very Go… H     SI1      61.9    55   337  4.07  4.11  2.53 NA    NA    
##  9  0.22 Fair     E     VS2      65.1    61   337  3.87  3.78  2.49 NA    NA
## 10  0.23 Very Go… H     VS1      59.4    61   338  4     4.05  2.39 NA    NA
## # … with 53,930 more rows

CodePudding user response:

df <- data.frame(grade = c("A", "B", "C"),
                 score = c(1, 2, 3))

cols <- c('grade', 'score', 'status')

for (i in cols){
    if (!(i %in% colnames(df))){
        df[i] <- NA
    }
}

> df
  grade score status
1     A     1     NA
2     B     2     NA
3     C     3     NA
  • Related