I have a vector of column names (262 of them) and a data frame with the same column names, but missing several of the columns (so rather than 262 columns, I have 250 columns).
Is there a way to set the columns of the data frame to the vector containing the complete set of column names, and for the names which there is no data for - generate a column of NAs?
For example, if I have a vector and data frame that looks like:
column_names <- c("col1","col2","col3","col4","col5","col6","col7")
df1 <- data.frame(c(20,30,10,20,45),c(1,10,2,45,2),c(9,10,22,11,12),c(87,7,7,7,14),c(89,90,99,12,1))
names(df1) <-c("col1","col2","col3","col5","col6")
# col1 col2 col3 col5 col6
# 20 1 9 87 89
# 30 10 10 7 90
# 10 2 22 7 99
# 20 45 11 7 12
# 45 2 12 14 1
I'd like to generate one that looks like:
df2 <- data.frame(c(20,30,10,20,45),c(1,10,2,45,2),c(9,10,22,11,12),c("NA","NA","NA","NA","NA"),c(87,7,7,7,14),c(89,90,99,12,1),c("NA","NA","NA","NA","NA"))
names(df2) <-column_names
# col1 col2 col3 col4 col5 col6 col7
# 20 1 9 NA 87 89 NA
# 30 10 10 NA 7 90 NA
# 10 2 22 NA 7 99 NA
# 20 45 11 NA 7 12 NA
# 45 2 12 NA 14 1 NA
CodePudding user response:
You could use setdiff
to get the missing column names and then add new columns.
column_names <- c("col1","col2","col3","col4","col5","col6","col7")
df <- data.frame(col1 = c(20,30,10,20,45),col2=c(1,10,2,45,2),col4=c(9,10,22,11,12),col5=c(87,7,7,7,14),col6=c(89,90,99,12,1))serdi(colnames(df),column_names)
missing_cols <- setdiff(column_names,colnames(df))
df[missing_cols] <- NA
df <- df[column_names]
df
col1 col2 col3 col4 col5 col6 col7
1 20 1 NA 9 87 89 NA
2 30 10 NA 10 7 90 NA
3 10 2 NA 22 7 99 NA
4 20 45 NA 11 7 12 NA
5 45 2 NA 12 14 1. NA
CodePudding user response:
You can try
cbind(df1, lapply(
setdiff(column_names, names(df1)),
\(x) setNames(data.frame(NA), x)
))[column_names]
# col1 col2 col3 col4 col5 col6 col7
# 1 20 1 9 NA 87 89 NA
# 2 30 10 10 NA 7 90 NA
# 3 10 2 22 NA 7 99 NA
# 4 20 45 11 NA 7 12 NA
# 5 45 2 12 NA 14 1 NA
CodePudding user response:
I came up with this vectorized solution:
## original number of columns in `df1`
nc <- length(df1)
## augment `df1` with an extra column of NA
df1$na <- NA
## map all columns not in `df1` to this extra NA column
df2 <- df1[match(column_names, names(df1)[1:nc], nomatch = nc 1)]
## fix names to desired ones
names(df2) <- column_names
## see the result
df2
# col1 col2 col3 col4 col5 col6 col7
#1 20 1 9 NA 87 89 NA
#2 30 10 10 NA 7 90 NA
#3 10 2 22 NA 7 99 NA
#4 20 45 11 NA 7 12 NA
#5 45 2 12 NA 14 1 NA
Here, I treat data.frames as named lists and use []
for all indexing operations. The same idea works for named lists and named vectors, too.