passing data through a sapply in R-CodePudding

I am working with R. I have 120 documents that look like this:

chair   table  hill  block  chain  ball  money house

  2       4     5      6      7      2     4     5 
  1       3     6      1      8      3     9     1
  2       1     1      6      1      8     2     3
  6       4     5      4      2      5     8     4
  5       5     5      5      3      2     6     7

I created a function:

myFunc <- function(x) c(mean = mean(x), n = length(x),
                        SD = sd(x))

Then, I used the sapply function and converted the results to a data frame.

lol3 <- sapply(mydat, myFunc)

lol4 <- as.data.frame(lol3)

How can pass this through all of my 120 documents and obtain the same output?

CodePudding user response：

You can read your data into R (assuming .csv files) with something like this, which will put the 120 dataframes into a list, dfs.

temp <- list.files(pattern="*.csv", 
                   full.names=TRUE)
dfs <- lapply(temp, read.csv)

Then, you can use purrr to apply your function to all columns in each dataframe.

library(tidyverse)

purrr::map(dfs, function(x) as.data.frame(map(x, myFunc)))

Or if you want to stick to the apply family, then you can use a combination of sapply and lapply, which produces the same output.

lapply(dfs, FUN = function(x) as.data.frame(sapply(x, myFunc)))

Output

[[1]]
        chair    table     hill
mean 5.707284 2.016096 4.632898
n    5.000000 5.000000 5.000000
SD   3.037238 2.609171 2.891649

[[2]]
        chair    table     hill
mean 4.972276 3.522378 2.779039
n    5.000000 5.000000 5.000000
SD   2.309736 2.402731 1.805293

[[3]]
        chair    table     hill
mean 4.614903 2.994203 3.573117
n    5.000000 5.000000 5.000000
SD   2.341367 2.250936 2.388503

[[4]]
        chair    table     hill
mean 3.962970 5.326495 5.289796
n    5.000000 5.000000 5.000000
SD   3.454892 3.250463 2.261613

Data and function

dfs <-
  list(
    structure(
      list(
        chair = c(
          8.5190371570643,
          7.96348396944813,
          0.909618176054209,
          6.22358219046146,
          4.92070004786365
        ),
        table = c(
          6.57108870637603,
          1.24985219235532,
          0.31808012444526,
          1.61383197060786,
          0.327624693978578
        ),
        hill = c(
          6.09723530360498,
          4.0752498563379,
          0.514291892526671,
          8.34481998276897,
          4.13289327942766
        )
      ),
      class = "data.frame",
      row.names = c(NA,-5L)
    ),
    structure(
      list(
        chair = c(
          6.7650549269747,
          3.03855412406847,
          4.73554673418403,
          7.83120877831243,
          2.49101754790172
        ),
        table = c(
          0.390065581072122,
          4.98121203482151,
          2.45721989544109,
          6.66204453259706,
          3.12134596821852
        ),
        hill = c(
          1.91045304620638,
          4.58099147421308,
          0.0588874609675258,
          3.53219888708554,
          3.81266354466788
        )
      ),
      class = "data.frame",
      row.names = c(NA,-5L)
    ),
    structure(
      list(
        chair = c(
          6.3361757278908,
          6.55107694189064,
          0.718950896523893,
          4.66502936370671,
          4.80328303738497
        ),
        table = c(
          3.64353043888696,
          0.393067884491757,
          5.97248072689399,
          1.12406565248966,
          3.83787051541731
        ),
        hill = c(
          2.51783188269474,
          1.69069789093919,
          2.80711475084536,
          3.11010800697841,
          7.73983212606981
        )
      ),
      class = "data.frame",
      row.names = c(NA,-5L)
    ),
    structure(
      list(
        chair = c(
          1.01416687178425,
          8.35876755136997,
          6.14014947647229,
          4.20261403801851,
          0.0991506285499781
        ),
        table = c(
          5.58780367858708,
          6.96770576946437,
          8.87186487391591,
          0.143813505303115,
          5.06128515000455
        ),
        hill = c(
          5.23900085524656,
          8.74273954122327,
          5.80283471778966,
          2.78146772808395,
          3.8829374271445
        )
      ),
      class = "data.frame",
      row.names = c(NA,-5L)
    )
  )

myFunc <- function (x)
  c(mean = mean(x),
    n = length(x),
    SD = sd(x))

CodePudding user response：

You need to read-in the paths of your documents. I once wrote a medium article about path handling functions in R. There point 4 is about how to recursively list files in a folder.

# let's say your files have all the ending .txt
paths <- list.files("/to/dir_containing_all_the_files", 
                    pattern="\\.txt", 
                    full.names=TRUE, # give absolute paths
                    recursive=TRUE)  # search also in subfolders for files

# you read the correct file-read functions:
# let's assume - like your input - they are TAB (?) separated
dfs <- lapply(paths, function(path) read.delim(path, header=TRUE, sep="\t"))

# in case of csv files:
dfs <- lapply(paths, function(path) read.csv(path, header=TRUE))

# you have to take a single file and adjust the read.delim function correctly
# test it for some files until it works!

# the dfs should contain each of the data frames which were read-in.
# you know how you proceeded for a single data frame -
# just write a function for a single data frame:
process_df <- function(df) sapply(df, myFunc)

# use this to s/lapply over the dfs:
process_dfs <- function(dfs) sapply(dfs, process_df)

# then you call:
result <- process_dfs(dfs)
# and see how it is structured:
str(result)

Or you go through all the procedure for one single path, and then loop over it:

process_path <- function(path) {
  df <- read.delim(path, header=TRUE, sep="\t")
  sapply(df, myFunc)
}

# finally, you can write:
process_paths <- function(paths) sapply(paths, process_path)

# and call:
process_paths(paths)