Home > Net >  How to calculate the average of the same column (with same name) in 100s different csv files with pa
How to calculate the average of the same column (with same name) in 100s different csv files with pa

Time:01-11

I have a bunch of csv files that are structured like this:

df <- data.frame (first_column  = c(3, 2, 6, 7),
                  second_column = c(7, 5, 1, 8))

All the csv files have a name like

"type1_1.csv"
"type1_2.csv"
...
"type2_1.csv"
"type2_2.csv"
...

Each of these csv's have first_column and second_column. What I want is to create a new dataframe that looks like this:

# name        meanofsecond_column
# type1_1     5.25
# ...

What I started doing already, was to individually write out each:

type1_1 <- read_csv("type1_1.csv")
type1_1mean <- mean(type1_1$second_column)
...
df <- data.frame (name  = c(type1_1, type1_2...),
                  meanofsecondcolumn = c(type1_1mean, type1_2mean...))

However, since there are more than 100 csv files, this method is not very efficient or clean. How can I make it more condensed?

CodePudding user response:

# path where your csv files are (here current working directory)
CSV_FOLDER <- "."

# list all csv files in given directory
# second parameter is a regex meaning ends with .csv
# third parameter make function return file names with path
csv_files <- list.files(CSV_FOLDER, "\\.csv$", full.names=TRUE)

# apply given function on each file and collect results in a list
res <- lapply(csv_files, function(csv_file) {
  # read current file
  tmp <- read.csv(csv_file)

  # build a data.frame from filename (without path) and mean of second column
  return(data.frame(
    name = basename(csv_file),
    meanofsecondcolumn = mean(tmp$second_column)
  ))
})

# rbind all single line data.frames in a single data.frame
res <- do.call(rbind(res))

CodePudding user response:

Using map

library(purrr)
library(dplyr)
library(stringr)
files <- list.files(path = "/path/to/folder", pattern = "\\.csv$", 
      full.names = TRUE)
nm <- str_remove(basename(files), "\\.csv$")
map_dfr(setNames(files, nm), ~ read.csv(.x) %>%
          summarise(secondcolumn = mean(second_column,
             na.rm = TRUE)), .id = "name")
  • Related