I have a list of column names as follow,
cols <- c('grade', 'score', 'status')
If the data frame doesn't have any of the columns in the cols vector, I want to add that column (values as NA) to the data frame using mutate and across. How to do that?
CodePudding user response:
A base
solution:
df[setdiff(cols, names(df))] <- NA
This command can be adapted for pipeline:
df %>%
`[<-`(, setdiff(cols, names(.)), NA)
# id score grade status
# 1 1 94 NA NA
# 2 2 98 NA NA
# 3 3 93 NA NA
# 4 4 82 NA NA
# 5 5 89 NA NA
Data
set.seed(123)
df <- data.frame(id = 1:5, score = sample(80:100, 5))
CodePudding user response:
A solution using dplyr::mutate()
Suppose that your data frame is diamonds
. Then add a tibble data frame that has the same number of columns as the column names (i.e. three columns, in this MWE) to the original data frame (i.e. diamond
here).
To create a tibble that contains NA
automatically
(Thanks to the comment by Darren Tsai)
To create a tibble that has the same number of columns as the column names, you can first create a matrix
that has the same number of columns as the column names by matrix(ncol = length(cols))
, and second, transform it to a tibble data frame by as_tibble()
and set the column names using .name_repair = ~ cols
inside of as_tibble()
.
The value of each column of the tibble is logicalNA
, when the matrix is created. Note but you may prefer one of NA_integer_
, NA_real_
, NA_complex_
, or NA_character_
over NA
, if you want mutate these newly added columns later on as integer
columns, numeric columns, complex columns (e.g. 1 5i), and character columns, respectively. In such a case, you can mutate
the tibble so that you can change the type of column.
You can create such a tibble inside of mutate
.
cols <- c('grade', 'score', 'status')
diamonds |>
mutate(
matrix(
ncol = length(cols)
) |>
as_tibble(
.name_repair = ~ cols
) |>
## if you want to interpret the grade as `factor` type...
mutate(
grade = as.factor(grade)
)
)
## # A tibble: 53,940 × 13
## carat cut color clarity depth table price x y z grade score
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <lgl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA NA
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 NA NA
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA NA
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 NA NA
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA NA
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 NA NA
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 NA NA
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 NA NA
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA NA
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 NA NA
## # … with 53,930 more rows, and 1 more variable: status <lgl>
To create the NA
tibble that does not have any column that matches to the original data frame
(Thanks to the comment by Julian)
To ensure that the columns are added to the original data frame only if the original data frame doesn't have any of the columns in the cols
vector, you have to select the columns of the NA
tibble data frame that are not present in the original data frame. You can do that by using !select(matches(colnames(diamonds)))
.
cols <- c("grade", "price", "status")
matrix(ncol = length(cols)) |>
as_tibble(
.name_repair = ~ cols
) |>
mutate(
grade = as.factor(grade)
)
diamonds |>
mutate(
matrix(
ncol = length(cols)
) |>
as_tibble(
.name_repair = ~cols
) |>
## if you want to interpret the grade as `factor` type...
mutate(
grade = as.factor(grade)
) |>
## select columns that are not present in the original data frame
dplyr::select(
!matches(colnames(diamonds))
)
)
## # A tibble: 53,940 × 12
## carat cut color clarity depth table price x y z grade status
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <lgl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA NA
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 NA NA
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA NA
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 NA NA
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA NA
## 6 0.24 Very Go… J VVS2 62.8 57 336 3.94 3.96 2.48 NA NA
## 7 0.24 Very Go… I VVS1 62.3 57 336 3.95 3.98 2.47 NA NA
## 8 0.26 Very Go… H SI1 61.9 55 337 4.07 4.11 2.53 NA NA
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA NA
## 10 0.23 Very Go… H VS1 59.4 61 338 4 4.05 2.39 NA NA
## # … with 53,930 more rows
CodePudding user response:
df <- data.frame(grade = c("A", "B", "C"),
score = c(1, 2, 3))
cols <- c('grade', 'score', 'status')
for (i in cols){
if (!(i %in% colnames(df))){
df[i] <- NA
}
}
> df
grade score status
1 A 1 NA
2 B 2 NA
3 C 3 NA