Home > Software design >  full join of multiple csv files in R
full join of multiple csv files in R

Time:02-16

i know similar questions have been asked multiple times, but somehow I am stuck very near the solution. Doing some research did not bring me closer. It is possible that, as a beginner in R, i'm asking the wrong questions/searching the wrong keywords. In that case, you are welcome to help.

I have a folder containing multiple .csv files with ";" as delimiter. These files have two columns forming the id. All tables have a different size, e.g. the number of rows varies as does the number of columns.

As an example the data looks like this:

name extension data1 data2 data n
paul ch 7 9 n1
sue nc 0 8 n2
sue ch 6 8 n3
etc.

The data has the following logic: data1 includes all data from 1 sample (sample name: data1) and includes the number of occurencies of names, either nc or ch.

My aim is now to bring together all these tables into a combined data frame.

First I was trying iterating with a for function, but this gave poor results. Later I came accross a tidyverse approach as follows:

files <- dir(path = "~/Documents/data", pattern = "*.csv")

data <- files %>%
  map(read_csv2(paste("~/Documents/data", files, sep = "/"))) %>%
  reduce(full_join)
data

This starts working, but fails with the error:

0sError: Files must all have 31 columns:                                             0s
* File 2 has 34 columns

Here is where I'm stuck. I think the problem occurs when the dataframes are compiled into a list. there it seems to me, that it is a necessity for the dataframes to be of the same extensions. In the examples I found on stack and elsewhere, the example data has always the same length.

Can anybody help me find the mistake?

Best wishes

CodePudding user response:

map syntax is map(list, function). read_csv2(paste("~/Documents/data", files, sep = "/") isn't a function, it's trying to run read_csv2 on all the files, but read_csv2 isn't vecrorized. Change to purrr style lambda syntax as below. I'd also strongly recommend specifying the by columns in your join, especially a full join, to make sure you're getting what you think you are and the result doesn't blow up in size.

files %>% 
  map(~read_csv2(paste("~Documents/data", ., sep = "/")) %>%
  reduce(full_join, by = c("name", "extension"))

If you need to debug more, give yourself a small example and do it one line at a time. Say, start with files2 <- files[1:2] and just work on reading in the first two files. When that works, move on to the join.

  • Related