How to load multiple csv files with different numbers of columns into an R dataframe?-CodePudding

I have lots of csv files which are mostly the same but some have a different number of columns. I want to load named columns and ignore any others. (all files have a header row with column names)

I am hoping I can do something like this so that when a file has a column called 'Temp' it just gets ignored, but I don't know where to put myCols to get that outcome.

fileList <- dir("thedata")
myCols <- c('Time','Place')
df <- do.call(rbind, lapply(fileList, read.csv))

CodePudding user response：

First I created some example data frames to make a list similar to your description.

# Example data frames
df1 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10])
df2 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10], Temp = runif(10, 0, 1))
df3 <- data.frame(Time = runif(10, 0, 100), Place = letters[11:20])

# List of data frames
df_l <- list(df1, df2, df3)

Now a possible solution

# Now you can select only the target columns using Map and column names
myCols <- c('Time','Place')
df_l <- Map(function(x){x[,names(x) %in% my_cols]}, df_l)

# now you can simply rbind the data frames (which now all have the same columns)
df <- do.call("rbind", df_l)

head(df)
       Time place
1  6.474527     a
2 78.747293     b
3  9.758012     c
4 79.765140     d
5  9.130967     e
6 34.237917     f

CodePudding user response：

Download xsv (releases) and put it on your path and then if Filenames is a character vector with the filenames run the following code. The file column in the result will contain the filename that each row originally came from. If you don't need that the last line can be just rbindlist(L).

library(data.table)

L <- lapply(paste("xsv select Time,Place", Filenames), fread)
rbindlist(setNames(L, Filenames), idcol = "file")

CodePudding user response：

Not sure about how you would avoid reading the unwanted columns but the plyr package has a really useful function rbind.fill which joins data frames together row-wise and fills with missing values if the column is not available.

all_data <- Map(read.csv, fileList)
df <- plyr::rbind.fill(all_data)

Then you can drop the columns you do not want from df.

As others have mentioned, to speed up reading the data you might want to use read_csv from readr or fread from data.table. As they are both faster than read.csv.