In R, how do I add the content of 2,000 .txt files to create a dataframe per each file?-CodePudding

I have three folders on my desktop, each containing 2,000 .txt files, with files named numerically from 1.txt through 2,000.txt. I want to use R to create data frames using the contents of the .txt files, where each line from the file is a row in the data frame. I want the columns of the data frame to be

Eg: Data Frame 1:

| Folder 1 | Folder 2 | Folder 3 |
| -------- | -------- | --------
| contents | contents | contents of
| of 1.txt | of 1.txt | 1.txt
 from folder from folder from folder
  1.           2.         3. 

Data Frame 2: 
| Folder 1 | Folder 2 | Folder 3 |
| -------- | -------- | --------
| contents | contents | contents of
| of 2.txt | of 2.txt | 2.txt
 from folder from folder from folder
  1.           2.         3.

I was able to create a dataframe for one .txt from one folder using:

setwd('/Users/name/Desktop/foldername')
txt_1 = readLines("1.txt")
df1 = data.frame(txt_1)

How do I iterate through the folder and create separate data frames for each of the 2,000 txt files?
Additionally, how do I add the 2,000 corresponding txt files from folder 2 and folder 3 and columns 2 and 3 in each of the dataframes?

Thank you for the help!

CodePudding user response：

This is how I would approach it, personally. In fact, I do a similar approach to process log files from some automated tasks I run at my workplace.

Most of this could be done without using dplyr and tidyr, however, the tibble object (a variant of a data.frame) has a print method that will prove to be useful if you are processing large amounts of data [1].

I've included a small description of what the code is doing above each step. I see you're pretty new to Stack Overflow, but am not sure what your experience level is with R in general, so please don't be afraid to ask for clarification.

library(dplyr)
library(tidyr)
library(parallel)

parent_dir <- "path_to_your_desktop"
folder <- c("Folder1", "Folder2", "Folder3")

cl <- makeCluster(detectCores() - 1)

AllData <- 
  # Make a data frame with the path to the folders
  tibble(folder_path = file.path(parent_dir, folder), 
         folder = folder) %>% 
  # Return a name of all of the files in the folders.
  # Consider using the `pattern` argument in ?list.files
  # if the files you want to read all have a common naming convention
  mutate(files = lapply(folder_path, 
                        list.files)) %>% 
  unnest(cols = 'files') %>% 
  # Read the data from each of the files. 
  # Whatever function you are using to process your data can 
  # replace read.csv.  I'm using it just for illustration.
  # Note: you mention a lot of file, I'm using the parallel version
  # of lapply here as it might cut down on the time it takes to read
  # all of your files.
  mutate(data = parLapply(cl, 
                          file.path(folder_path, files), 
                          read.csv, 
                          stringsAsFactors = FALSE), 
         # You mentioned you would want to remove data frames that
         # have a different number of rows.  Here's a quick way 
         # to get the number of rows in each data frame.
         nrows = vapply(data, 
                        nrow, 
                        numeric(1))) %>% 
  select(-folder_path)

AllData
#> # A tibble: 5 x 4
#>   folder  files                        data             nrows
#>   <chr>   <chr>                        <list>           <dbl>
#> 1 Folder1 PointlessData.txt            <df[,3] [3 x 3]>     3
#> 2 Folder2 PointlessData.txt            <df[,3] [3 x 3]>     3
#> 3 Folder3 PointlessData - Copy (2).txt <df[,3] [3 x 3]>     3
#> 4 Folder3 PointlessData - Copy.txt     <df[,3] [3 x 3]>     3
#> 5 Folder3 PointlessData.txt            <df[,3] [3 x 3]>     3
Created on 2022-10-27 by the reprex package (v2.0.1)

[1] In a data.frame, columns made up of lists are printed in full. For large amounts of data, this could bring your R session or computer to a halt. In a tibble, all that is printed is an indication of a list element.

CodePudding user response：

You probably want something like this. 2000 separate dataframes all in the global environment seems like too much. Maybe putting them in a list would be more manageable. PS, I could not test this since I do not know what your files look like. Also, since you want to skip files with different row numbers, I added that in.

library(tidyverse)

loc_1 <- '/Users/name/Desktop/foldername1/'
loc_2 <- '/Users/name/Desktop/foldername2/'
loc_3 <- '/Users/name/Desktop/foldername3/'

df_list <- map(1:20,
               \(i){
                df1 <- read_lines(glue::glue("{loc_1}{i}.txt")) |>
                  data.frame()
                
                df2 <- read_lines(glue::glue("{loc_2}{i}.txt")) |>
                  data.frame()
                
                df3 <- read_lines(glue::glue("{loc_3}{i}.txt")) |>
                  data.frame()
                
                if(all(c(nrow(df1), nrow(df2), nrow(df3)) == nrow(df1))){
                  bind_cols(df1, df2, df3)
                } else NA
                
               }) |>
  `names<-`(paste0(1:2000, ".txt"))

CodePudding user response：

Here's a base R solution. It creates a list of dataframes, where file number corresponds to list index; cases where the number of lines per file don't match are NULL.

dfs <- list()
folders <- c("Folder 1", "Folder 2", "Folder 3")

for (i in 1:2000) {
  paths <- file.path(folders, paste0(i, ".txt"))
  names(paths) <- folders
  files <- lapply(paths, readLines)
  if (length(unique(sapply(files, length))) == 1) {
    dfs[[i]] <- data.frame(files)
  }
}