Home > Software design >  How do I run descriptive statistics in R for each file in a directory and append it to a data frame
How do I run descriptive statistics in R for each file in a directory and append it to a data frame

Time:09-04

I have a folder of files (csv) that have filtered/gated data -- two columns (dihedral angle vs bend angle). It was filtered based upon an individualized min and max for each file.

I need to be able to get at least the mean, median, sd, skewness, and kurtosis for each column of each file and have that data presented as a table. (One line per file in the final product)

I am not familiar with what R packages that maybe suitable for this task, so I was trying to do something simple. I can get it to work for a single file, but I have over 200 files. They will likely be updating, so I'll have to run this multiple times.

module load ccs/container/R/4.1.0
R

library(moments)

files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)

cat("filename","\t","dihedral mean","\t","bend mean","\t","dihedral median","\t","bend median","\t","dh sd","\t","bd sd","\t","dh skew","\t","bd skew","\t","dh kurt","\t","bd kurt","\n")

for (currentFile in files) {
  df <- read.table(fileName[i], header=TRUE)

  z1 <- mean(df$V1)
  z2 <- median(df$V1)
  z3 <- sd(df$V1)
  z4 <- skewness(df$V1)
  z5 <- kurtosis(df$V1)

  z7 <- mean(df$V2)
  z8 <- median(df$V2)
  z9 <- sd(df$V2)
  z10 <- skewness(df$V2)
  z11 <- kurtosis(df$V2)
  
  cat(filename,"\t",z1,"\t",z7,"\t",z2,"\t",z8,"\t",z3,"\t",z9,"\t",z4,"\t",z10,"\t",z5,"\t",z11,"\n")

  write.table(newdata, file=statsFileName[i]))
}

The "first cat line" is the header and labels.

The "for cat line" likely goes "no where," but it is the format that I am trying to achieve.

The "write.table line" is something that I found, but I don't think it may be appropriate for this.

I truly appreciate any help on this. I am not that familiar with R and the examples that I have found do not appear to relate enough to what I trying to do for me to adapt them.

CodePudding user response:

The following computes all statistics the question asks for for each file and writes a table of results to a CSV file.

library(moments)
#
stats <- function(filename, na.rm = TRUE) {
  x <- read.csv(filename)
  xbar <- colMeans(x, na.rm = na.rm)
  med <- apply(x, 2, median, na.rm = na.rm)
  S <- apply(x, 2, sd, na.rm = na.rm)
  skwn <- skewness(x, na.rm = na.rm)
  kurt <- kurtosis(x, na.rm = na.rm)
  #
  # return a data.frame, it will 
  # make the code simpler further on
  out <- data.frame(
    filename = filename, 
    dihedral.mean = xbar[1],
    bend.mean = xbar[2],
    dihedral.median = med[1],
    bend.median = med[2],
    dihedral.sd = S[1],
    bend.sd = S[2],
    dihedral.skewness = skwn[1],
    bend.skewness = skwn[2],
    dihedral.kurtosis = kurt[1],
    bend.kurtosis = kurt[2]
  )
  row.names(out) <- NULL
  out
}

statsFileName <- "statsfile.txt"

#files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
files <- list.files("~/Temp", "^t.*\\.csv$")

newdata <- lapply(files, stats)
newdata <- do.call(rbind, newdata)

write.csv(newdata, file = statsFileName, row.names = FALSE)

CodePudding user response:

This solution uses dplyr to summarise each file, combines the summaries into a single dataframe, then writes the results to a csv file.

library(moments)
library(dplyr)

### Create dummy csv files for reproducibility ###
if(!dir.exists("./data/")) dir.create("./data/")
for(i in 1:200){
    write.csv(data.frame(V1 = runif(100), V2 = runif(100)),
              file = paste0("./data/file_", i, ".csv"),
              row.names = FALSE)
}

### Summarise files ###
files <- list.files("./data", full.names = TRUE)
all_results <- vector("list", length(files)) # results placeholder

# Loop that calculates summary statistics
for (i in 1:length(files)) {
    currentFile <- files[i]
    df <- read.csv(file = currentFile, header=TRUE)
    result <- df %>% summarise_all(list(mean = mean, median = median, 
                              sd = sd, skew = skewness, kur = kurtosis))%>% 
        mutate(file = currentFile) %>% # add filename to the result
        select(file, everything()) # reorder 
    all_results[[i]] <- result
}

# Combine results into a single df
final_table <- bind_rows(all_results)

# write file
write.csv(final_table, "results.csv", row.names = FALSE)
  • Related