I have a folder of files (csv) that have filtered/gated data -- two columns (dihedral angle vs bend angle). It was filtered based upon an individualized min and max for each file.
I need to be able to get at least the mean, median, sd, skewness, and kurtosis for each column of each file and have that data presented as a table. (One line per file in the final product)
I am not familiar with what R packages that maybe suitable for this task, so I was trying to do something simple. I can get it to work for a single file, but I have over 200 files. They will likely be updating, so I'll have to run this multiple times.
module load ccs/container/R/4.1.0
R
library(moments)
files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
cat("filename","\t","dihedral mean","\t","bend mean","\t","dihedral median","\t","bend median","\t","dh sd","\t","bd sd","\t","dh skew","\t","bd skew","\t","dh kurt","\t","bd kurt","\n")
for (currentFile in files) {
df <- read.table(fileName[i], header=TRUE)
z1 <- mean(df$V1)
z2 <- median(df$V1)
z3 <- sd(df$V1)
z4 <- skewness(df$V1)
z5 <- kurtosis(df$V1)
z7 <- mean(df$V2)
z8 <- median(df$V2)
z9 <- sd(df$V2)
z10 <- skewness(df$V2)
z11 <- kurtosis(df$V2)
cat(filename,"\t",z1,"\t",z7,"\t",z2,"\t",z8,"\t",z3,"\t",z9,"\t",z4,"\t",z10,"\t",z5,"\t",z11,"\n")
write.table(newdata, file=statsFileName[i]))
}
The "first cat line" is the header and labels.
The "for cat line" likely goes "no where," but it is the format that I am trying to achieve.
The "write.table line" is something that I found, but I don't think it may be appropriate for this.
I truly appreciate any help on this. I am not that familiar with R and the examples that I have found do not appear to relate enough to what I trying to do for me to adapt them.
CodePudding user response:
The following computes all statistics the question asks for for each file and writes a table of results to a CSV file.
library(moments)
#
stats <- function(filename, na.rm = TRUE) {
x <- read.csv(filename)
xbar <- colMeans(x, na.rm = na.rm)
med <- apply(x, 2, median, na.rm = na.rm)
S <- apply(x, 2, sd, na.rm = na.rm)
skwn <- skewness(x, na.rm = na.rm)
kurt <- kurtosis(x, na.rm = na.rm)
#
# return a data.frame, it will
# make the code simpler further on
out <- data.frame(
filename = filename,
dihedral.mean = xbar[1],
bend.mean = xbar[2],
dihedral.median = med[1],
bend.median = med[2],
dihedral.sd = S[1],
bend.sd = S[2],
dihedral.skewness = skwn[1],
bend.skewness = skwn[2],
dihedral.kurtosis = kurt[1],
bend.kurtosis = kurt[2]
)
row.names(out) <- NULL
out
}
statsFileName <- "statsfile.txt"
#files <- list.files("/mnt/gpfs2_4m/scratch/username/fs_scripts/foldedstart_*", pattern="*.csv", recursive=TRUE, full.names=TRUE)
files <- list.files("~/Temp", "^t.*\\.csv$")
newdata <- lapply(files, stats)
newdata <- do.call(rbind, newdata)
write.csv(newdata, file = statsFileName, row.names = FALSE)
CodePudding user response:
This solution uses dplyr to summarise each file, combines the summaries into a single dataframe, then writes the results to a csv file.
library(moments)
library(dplyr)
### Create dummy csv files for reproducibility ###
if(!dir.exists("./data/")) dir.create("./data/")
for(i in 1:200){
write.csv(data.frame(V1 = runif(100), V2 = runif(100)),
file = paste0("./data/file_", i, ".csv"),
row.names = FALSE)
}
### Summarise files ###
files <- list.files("./data", full.names = TRUE)
all_results <- vector("list", length(files)) # results placeholder
# Loop that calculates summary statistics
for (i in 1:length(files)) {
currentFile <- files[i]
df <- read.csv(file = currentFile, header=TRUE)
result <- df %>% summarise_all(list(mean = mean, median = median,
sd = sd, skew = skewness, kur = kurtosis))%>%
mutate(file = currentFile) %>% # add filename to the result
select(file, everything()) # reorder
all_results[[i]] <- result
}
# Combine results into a single df
final_table <- bind_rows(all_results)
# write file
write.csv(final_table, "results.csv", row.names = FALSE)