Using R, I'm trying to search many csv files for columns that contain a specific folder name. The data files will always contain two column names PATIENT_ID, EVENT_NAME and then the actual data in many other columns.
The problem is that I don't know beforehand what the other column names are going to be. So there are many different column names and the output should be in a different structure.
So for example, the input file can be like this:
PATIENT_ID EVENT_NAME bp bpfile result
PAT_001 event_1 78 /files_dir/bp.pdf NEG
PAT_002 event_1 65 NA POS
PAT_003 event_1 71 /files_dir/document.pdf POS
PAT_004 event_2 82
PAT_005 event_2 79 /files_dir/bla.jpg /files_dir/report.pdf
And then I want to search all column that contain the phrase files_dir
and create a new dataframe like this
PATIENT_ID EVENT_NAME var_name file_name
PAT_001 event_1 bpfile /files_dir/bp.pdf
PAT_003 event_1 bpfile /files_dir/document.pdf
PAT_005 event_2 bpfile /files_dir/bla.jpg
PAT_005 event_2 result /files_dir/report.pdf
So three bpfile
values and one result
value contain a filename.
Here is the code I've got so far:
# create newdata frame for output
df_output <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df_output) <- c("PATIENT_ID","EVENT_NAME","var_name","file_name")
# input data, test data
#mydata <- read.csv("patientct.csv", sep = ',', fileEncoding = 'UTF-8-BOM')
mydata <- data.frame(
PATIENT_ID = c('PAT_001', 'PAT_002', 'PAT_003', 'PAT_004', 'PAT_005'),
EVENT_NAME = c('event_1', 'event_1', 'event_1', 'event_2', 'event_2'),
bp = c(78, 65, 71, 82, 79),
bpfile = c('/files_dir/bp.pdf', NA, '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
result = c('NEG', 'POS', 'POS', '', '/files_dir/report.pdf')
)
# iterate all columns
for (colnam in colnames(mydata)) {
print(colnam)
if (!is.element(colnam, c("PATIENT_ID","EVENT_NAME")) )
{
# ???
count_file <- sum(grepl("files_dir", mydata[colnam]))
}
}
# output result
write.csv(df_output, "./outputfile_reslts.csv", row.names = TRUE)
I think there needs to be some pivot action, sort of. But I'm not that familiar with R, so I have no idea how to approach this. Any ideas?
CodePudding user response:
I actually used the idea you had and just used a pivot, or I suppose gather()
from tidyr. I have three steps, first step is I converted any factor columns to character (At least for me it will throw out a warning otherwise). My second step was to gather all columns except PATIENT_ID and EVENT_NAME. Then the third step is to filter to only the rows that have pdf or jpg in it. I'm not sure if this is precisely what you need but it might work:
library(tidyr)
library(dplyr)
mydata%>%
mutate_if(is.factor, as.character)%>%
gather("var_name", "file_name", -PATIENT_ID,-EVENT_NAME)%>%
filter(grepl("pdf|jpg", file_name))
Best of luck to you, I hope this helps!
CodePudding user response:
It sounds like you could also use a way to iterate over the numerous csv files. I suggest you can build on Silentdevildoll's solution as follows. We will turn Silentdevildoll's code into a function. Notice I also added a field for the source file.
find_fd <- function(csv, source_name){
tbl <- csv%>%
mutate_if(is.factor, as.character)%>%
gather("var_name", "file_name", -PATIENT_ID,-EVENT_NAME)%>%
filter(grepl("files_dir", file_name)) %>%
mutate(source = source_name)
return(tbl)
}
Then we create a function to read the files from your directory. Defaults to the active directory and assumes only csv files. For each file it calls find_fd().
search_csv <- function(dir = getwd()){
files <- list.files(dir)
results <- data.frame()
for (f in files){
file <- read.csv(f)
t <- find_fd(csv = file, source_name = f)
results <- rbind(results, t)
}
return(results)
}
I created two csv files based on the following:
ex1 <- data.frame(
PATIENT_ID = c('PAT_001', 'PAT_002', 'PAT_003', 'PAT_004', 'PAT_005'),
EVENT_NAME = c('event_1', 'event_1', 'event_1', 'event_2', 'event_2'),
bp = c(78, 65, 71, 82, 79),
bpfile = c('/files_dir/bp.pdf', NA, '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
result = c('NEG', 'POS', 'POS', '', '/files_dir/report.pdf')
)
ex2 <- data.frame(
PATIENT_ID = c('PAT_011', 'PAT_012', 'PAT_013', 'PAT_014', 'PAT_015'),
EVENT_NAME = c('event_3', 'event_1', 'event_3', 'event_2', 'event_3'),
bp = c(78, 65, 71, 82, 79),
bpfile = c(NA, '/files_dir/bp.pdf', '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
other1 = c('NEG', 'POS', 'POS', '/files_dir/report.pdf','' )
)
> myresults <- search_csv()
> myresults
PATIENT_ID EVENT_NAME var_name file_name source
1 PAT_001 event_1 bpfile /files_dir/bp.pdf ex1.csv
2 PAT_003 event_1 bpfile /files_dir/documnet.pdf ex1.csv
3 PAT_005 event_2 bpfile /files_dir/bla.jpg ex1.csv
4 PAT_005 event_2 result /files_dir/report.pdf ex1.csv
5 PAT_012 event_1 bpfile /files_dir/bp.pdf ex2.csv
6 PAT_013 event_3 bpfile /files_dir/documnet.pdf ex2.csv
7 PAT_015 event_3 bpfile /files_dir/bla.jpg ex2.csv
8 PAT_014 event_2 result /files_dir/report.pdf ex2.csv