Home > Software design >  Combining multiple excel files in R, but with slightly different column names
Combining multiple excel files in R, but with slightly different column names

Time:06-22

So I have 2 different surveys that I'm working with. For standardized survey #1, I have hundreds of different excel files that I combined using the codes below (the columns/questions are the same in each excel/survey file).

my_files <- list.files(pattern = "*.xls")

Survey <- lapply(my_files, function(i){
  x = read_excel(i, sheet=1, skip = 14)
  x$file = i
  x
})

Survey = do.call("rbind.data.frame", Survey)

This does the trick in easily combining all the different surveys since they have the same column names.

However, my survey #2 is slightly different which makes it difficult to aggregate the files together. Few of the columns start out with similar patterns but are not titled the exact same way, making it difficult to combine. E.g.,

Excel #1:

Participant 4(a). Behavior ABC 4(b). Behavior DEF 4(c). Behavior GHI Course Title
1 4 7 7 Course 1
2 5 5 2 Course 1
3 5 6 3 Course 1
4 6 5 6 Course 1

Excel #2:

Participant 4(a). Behavior XYZ 4(b). Behavior UVW 4(c). Behavior RST Course Title
1 5 4 6 Course 2
2 6 3 7 Course 2
3 6 4 6 Course 2
4 4 5 6 Course 2

For my current purpose, I want to be able to combine dozens or even hundreds of these surveys and show the average score of 4(a). Behavior for all the surveys, for 4(b). Behavior, etc. Is there a simple way to let R know that it can combine columns that start with "4(a)." even if it's named slightly different across files?

CodePudding user response:

Since you are binding the rows of each file, and assuming the columnas are in the same order in each file; you can rename the columns as you read them with read_excel() using the names argument. Then you can bind them as usual.

For example.

Survey <- lapply(my_files, function(i){
  x = read_excel(i, sheet=1, skip = 14, 
                 names = c("participant", "behavior_4_a", "behavior_4_b",   
                           "behavior_4_c", "course_title")
  x$file = i
  x
})

CodePudding user response:

One approach would be to standardize the names. If the corresponding questions are in the same order, this could be as easy as

colnames(df) <- c("these", "are", "your", "column", "names")

You can also select columns by index rather than name:

df[,c(1, 2, 3)]

I would add all the different excel files to a new data frame with the name that you actually want, accessing columns by index.

Writing a function to decide which columns correspond to other columns can be done with dplyr::select_if, however. For instance,

new_df <- dplyr::select_if(\(x){grepl("4(a)",x,fixed=T)}) 

Theres probably a way to write that function using regex, too, if your pattern matching is more complicated. select_if simply has to be passed a predicate

See select_if, \ "lambda" like function, grepl.

I haven't tested this code myself, but this approach should work.

  • Related