Is there a way to match the columns of a data frame with a vector of complete column names?-CodePudding

I have a vector of column names (262 of them) and a data frame with the same column names, but missing several of the columns (so rather than 262 columns, I have 250 columns).

Is there a way to set the columns of the data frame to the vector containing the complete set of column names, and for the names which there is no data for - generate a column of NAs?

For example, if I have a vector and data frame that looks like:

column_names <- c("col1","col2","col3","col4","col5","col6","col7")

df1 <- data.frame(c(20,30,10,20,45),c(1,10,2,45,2),c(9,10,22,11,12),c(87,7,7,7,14),c(89,90,99,12,1))
names(df1) <-c("col1","col2","col3","col5","col6")

# col1 col2 col3 col5 col6
# 20   1    9    87   89
# 30   10   10   7    90
# 10   2    22   7    99
# 20   45   11   7    12
# 45   2    12   14   1

I'd like to generate one that looks like:

df2 <- data.frame(c(20,30,10,20,45),c(1,10,2,45,2),c(9,10,22,11,12),c("NA","NA","NA","NA","NA"),c(87,7,7,7,14),c(89,90,99,12,1),c("NA","NA","NA","NA","NA"))
names(df2) <-column_names

# col1 col2 col3 col4 col5 col6 col7
# 20   1    9    NA   87   89   NA
# 30   10   10   NA   7    90   NA
# 10   2    22   NA   7    99   NA
# 20   45   11   NA   7    12   NA
# 45   2    12   NA   14   1    NA

CodePudding user response：

You could use setdiff to get the missing column names and then add new columns.

column_names <- c("col1","col2","col3","col4","col5","col6","col7")

df <- data.frame(col1 = c(20,30,10,20,45),col2=c(1,10,2,45,2),col4=c(9,10,22,11,12),col5=c(87,7,7,7,14),col6=c(89,90,99,12,1))serdi(colnames(df),column_names)

missing_cols <- setdiff(column_names,colnames(df))
df[missing_cols] <- NA

df <- df[column_names]

df

  col1 col2 col3 col4 col5 col6 col7
1   20    1   NA    9   87   89   NA
2   30   10   NA   10    7   90   NA
3   10    2   NA   22    7   99   NA
4   20   45   NA   11    7   12   NA
5   45    2   NA   12   14    1.  NA

CodePudding user response：

You can try

cbind(df1, lapply(
  setdiff(column_names, names(df1)),
  \(x) setNames(data.frame(NA), x)
))[column_names]

#   col1 col2 col3 col4 col5 col6 col7
# 1   20    1    9   NA   87   89   NA
# 2   30   10   10   NA    7   90   NA
# 3   10    2   22   NA    7   99   NA
# 4   20   45   11   NA    7   12   NA
# 5   45    2   12   NA   14    1   NA

CodePudding user response：

I came up with this vectorized solution:

## original number of columns in `df1`
nc <- length(df1)
## augment `df1` with an extra column of NA
df1$na <- NA
## map all columns not in `df1` to this extra NA column
df2 <- df1[match(column_names, names(df1)[1:nc], nomatch = nc   1)]
## fix names to desired ones
names(df2) <- column_names
## see the result
df2
#  col1 col2 col3 col4 col5 col6 col7
#1   20    1    9   NA   87   89   NA
#2   30   10   10   NA    7   90   NA
#3   10    2   22   NA    7   99   NA
#4   20   45   11   NA    7   12   NA
#5   45    2   12   NA   14    1   NA

Here, I treat data.frames as named lists and use [] for all indexing operations. The same idea works for named lists and named vectors, too.