I have three folders on my desktop, each containing 2,000 .txt files, with files named numerically from 1.txt through 2,000.txt. I want to use R to create data frames using the contents of the .txt files, where each line from the file is a row in the data frame. I want the columns of the data frame to be
Eg: Data Frame 1:
| Folder 1 | Folder 2 | Folder 3 |
| -------- | -------- | --------
| contents | contents | contents of
| of 1.txt | of 1.txt | 1.txt
from folder from folder from folder
1. 2. 3.
Data Frame 2:
| Folder 1 | Folder 2 | Folder 3 |
| -------- | -------- | --------
| contents | contents | contents of
| of 2.txt | of 2.txt | 2.txt
from folder from folder from folder
1. 2. 3.
I was able to create a dataframe for one .txt from one folder using:
setwd('/Users/name/Desktop/foldername')
txt_1 = readLines("1.txt")
df1 = data.frame(txt_1)
How do I iterate through the folder and create separate data frames for each of the 2,000 txt files?
Additionally, how do I add the 2,000 corresponding txt files from folder 2 and folder 3 and columns 2 and 3 in each of the dataframes?
Thank you for the help!
CodePudding user response:
This is how I would approach it, personally. In fact, I do a similar approach to process log files from some automated tasks I run at my workplace.
Most of this could be done without using dplyr
and tidyr
, however, the tibble
object (a variant of a data.frame
) has a print method that will prove to be useful if you are processing large amounts of data [1].
I've included a small description of what the code is doing above each step. I see you're pretty new to Stack Overflow, but am not sure what your experience level is with R in general, so please don't be afraid to ask for clarification.
library(dplyr)
library(tidyr)
library(parallel)
parent_dir <- "path_to_your_desktop"
folder <- c("Folder1", "Folder2", "Folder3")
cl <- makeCluster(detectCores() - 1)
AllData <-
# Make a data frame with the path to the folders
tibble(folder_path = file.path(parent_dir, folder),
folder = folder) %>%
# Return a name of all of the files in the folders.
# Consider using the `pattern` argument in ?list.files
# if the files you want to read all have a common naming convention
mutate(files = lapply(folder_path,
list.files)) %>%
unnest(cols = 'files') %>%
# Read the data from each of the files.
# Whatever function you are using to process your data can
# replace read.csv. I'm using it just for illustration.
# Note: you mention a lot of file, I'm using the parallel version
# of lapply here as it might cut down on the time it takes to read
# all of your files.
mutate(data = parLapply(cl,
file.path(folder_path, files),
read.csv,
stringsAsFactors = FALSE),
# You mentioned you would want to remove data frames that
# have a different number of rows. Here's a quick way
# to get the number of rows in each data frame.
nrows = vapply(data,
nrow,
numeric(1))) %>%
select(-folder_path)
AllData
#> # A tibble: 5 x 4
#> folder files data nrows
#> <chr> <chr> <list> <dbl>
#> 1 Folder1 PointlessData.txt <df[,3] [3 x 3]> 3
#> 2 Folder2 PointlessData.txt <df[,3] [3 x 3]> 3
#> 3 Folder3 PointlessData - Copy (2).txt <df[,3] [3 x 3]> 3
#> 4 Folder3 PointlessData - Copy.txt <df[,3] [3 x 3]> 3
#> 5 Folder3 PointlessData.txt <df[,3] [3 x 3]> 3
Created on 2022-10-27 by the reprex package (v2.0.1)
[1] In a data.frame
, columns made up of lists are printed in full. For large amounts of data, this could bring your R session or computer to a halt. In a tibble
, all that is printed is an indication of a list element.
CodePudding user response:
You probably want something like this. 2000 separate dataframes all in the global environment seems like too much. Maybe putting them in a list would be more manageable. PS, I could not test this since I do not know what your files look like. Also, since you want to skip files with different row numbers, I added that in.
library(tidyverse)
loc_1 <- '/Users/name/Desktop/foldername1/'
loc_2 <- '/Users/name/Desktop/foldername2/'
loc_3 <- '/Users/name/Desktop/foldername3/'
df_list <- map(1:20,
\(i){
df1 <- read_lines(glue::glue("{loc_1}{i}.txt")) |>
data.frame()
df2 <- read_lines(glue::glue("{loc_2}{i}.txt")) |>
data.frame()
df3 <- read_lines(glue::glue("{loc_3}{i}.txt")) |>
data.frame()
if(all(c(nrow(df1), nrow(df2), nrow(df3)) == nrow(df1))){
bind_cols(df1, df2, df3)
} else NA
}) |>
`names<-`(paste0(1:2000, ".txt"))
CodePudding user response:
Here's a base R solution. It creates a list of dataframes, where file number corresponds to list index; cases where the number of lines per file don't match are NULL
.
dfs <- list()
folders <- c("Folder 1", "Folder 2", "Folder 3")
for (i in 1:2000) {
paths <- file.path(folders, paste0(i, ".txt"))
names(paths) <- folders
files <- lapply(paths, readLines)
if (length(unique(sapply(files, length))) == 1) {
dfs[[i]] <- data.frame(files)
}
}