Home > front end >  R: How to locate a column in a large dataframe by using information from another dataframe with less
R: How to locate a column in a large dataframe by using information from another dataframe with less

Time:09-17

I have a data frame (A) with a column containing some info. I have a larger data frame (B) that contains a column with similar information and I need to detect which column that contains the same data as the column in dataframeA. Because the dataframeB is large, it will be time-consuming to manually look through it though to identify the column. Is there a way that I can use the information from column 'some_info' in DataframeA to find the corresponding column in DataframeB where the information is contained?


dataframeA <- data.frame(some_info = c("a","b","c","d","e") )

dataframeB <- data.frame(id = 1:8, column_to_be_identified = c("a","f","b","c","g", "d","h", "e"), "column_almost_similar_but_not_quite" =c("a","f","b","c","g", "3","h", "e")  )

Basically: Is it possible to create a function or something similar that looks through dataframeB and detects the column(s) that contains exactly the information from the column in dataframeA?

Thanks a lot in advance!

CodePudding user response:

If I understand correctly and you just want to receive the column name:

dataframeA <- data.frame(some_info = c("a","b","c","d","e") )
dataframeB <- data.frame(id = 1:8, 
                         column_to_be_identified = c("a","f","b","c","g", "d","h", "e"),
                         column_almost_similar_but_not_quite = c("a","f","b","c","g", "3","h", "e")  )


relevant_column_name <- names(
  which(
    # iterate over all columns
    sapply(dataframeB, function(x) {
      # unique is more efficient for large vectors
      x <- unique(x)
      # are all values of the target vector in the column
      all(dataframeA$some_info %in% x)
    })))

relevant_column_name
#> [1] "column_to_be_identified"

CodePudding user response:

With select from dplyr we can do this

library(dplyr)
dataframeB %>% 
   select(where(~ is.character(.) && 
           all(dataframeA$some_info %in% .))) %>%
   names
[1] "column_to_be_identified"
  • Related