Search dataframe for columns with values that contains certain string and output new dataframe-CodePudding

Using R, I'm trying to search many csv files for columns that contain a specific folder name. The data files will always contain two column names PATIENT_ID, EVENT_NAME and then the actual data in many other columns.

The problem is that I don't know beforehand what the other column names are going to be. So there are many different column names and the output should be in a different structure.

So for example, the input file can be like this:

PATIENT_ID  EVENT_NAME   bp   bpfile                    result
PAT_001     event_1      78   /files_dir/bp.pdf         NEG
PAT_002     event_1      65   NA                        POS
PAT_003     event_1      71   /files_dir/document.pdf   POS
PAT_004     event_2      82                                
PAT_005     event_2      79   /files_dir/bla.jpg        /files_dir/report.pdf

And then I want to search all column that contain the phrase files_dir and create a new dataframe like this

PATIENT_ID  EVENT_NAME   var_name  file_name
PAT_001     event_1      bpfile    /files_dir/bp.pdf
PAT_003     event_1      bpfile    /files_dir/document.pdf
PAT_005     event_2      bpfile    /files_dir/bla.jpg
PAT_005     event_2      result    /files_dir/report.pdf

So three bpfile values and one result value contain a filename. Here is the code I've got so far:

# create newdata frame for output
df_output <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df_output) <- c("PATIENT_ID","EVENT_NAME","var_name","file_name")

# input data, test data
#mydata <- read.csv("patientct.csv", sep = ',', fileEncoding = 'UTF-8-BOM')
mydata <- data.frame(
  PATIENT_ID = c('PAT_001', 'PAT_002', 'PAT_003', 'PAT_004', 'PAT_005'),
  EVENT_NAME = c('event_1', 'event_1', 'event_1', 'event_2', 'event_2'),
  bp = c(78, 65, 71, 82, 79),
  bpfile = c('/files_dir/bp.pdf', NA, '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
  result = c('NEG', 'POS', 'POS', '', '/files_dir/report.pdf')
  )

# iterate all columns
for (colnam in colnames(mydata)) {
  print(colnam)
  
  if (!is.element(colnam, c("PATIENT_ID","EVENT_NAME")) )
  {
    # ???
    count_file <- sum(grepl("files_dir", mydata[colnam]))
  }
}

# output result
write.csv(df_output, "./outputfile_reslts.csv", row.names = TRUE)

I think there needs to be some pivot action, sort of. But I'm not that familiar with R, so I have no idea how to approach this. Any ideas?

CodePudding user response：

I actually used the idea you had and just used a pivot, or I suppose gather() from tidyr. I have three steps, first step is I converted any factor columns to character (At least for me it will throw out a warning otherwise). My second step was to gather all columns except PATIENT_ID and EVENT_NAME. Then the third step is to filter to only the rows that have pdf or jpg in it. I'm not sure if this is precisely what you need but it might work:

library(tidyr)
library(dplyr)
mydata%>%
  mutate_if(is.factor, as.character)%>%
  gather("var_name", "file_name", -PATIENT_ID,-EVENT_NAME)%>%
  filter(grepl("pdf|jpg", file_name))

Best of luck to you, I hope this helps!

CodePudding user response：

It sounds like you could also use a way to iterate over the numerous csv files. I suggest you can build on Silentdevildoll's solution as follows. We will turn Silentdevildoll's code into a function. Notice I also added a field for the source file.

find_fd <- function(csv, source_name){

    tbl <- csv%>%
        mutate_if(is.factor, as.character)%>%
        gather("var_name", "file_name", -PATIENT_ID,-EVENT_NAME)%>%
        filter(grepl("files_dir", file_name)) %>%
        mutate(source = source_name)
  return(tbl)
}

Then we create a function to read the files from your directory. Defaults to the active directory and assumes only csv files. For each file it calls find_fd().

search_csv <- function(dir = getwd()){
    
    files <- list.files(dir)  
    results <- data.frame()
    
    for (f in files){
        file <- read.csv(f)
        t <- find_fd(csv = file, source_name = f)
        
        results <- rbind(results, t)
    }
    return(results)
}

I created two csv files based on the following:

ex1 <- data.frame(
    PATIENT_ID = c('PAT_001', 'PAT_002', 'PAT_003', 'PAT_004', 'PAT_005'),
    EVENT_NAME = c('event_1', 'event_1', 'event_1', 'event_2', 'event_2'),
    bp = c(78, 65, 71, 82, 79),
    bpfile = c('/files_dir/bp.pdf', NA, '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
    result = c('NEG', 'POS', 'POS', '', '/files_dir/report.pdf')
)

ex2 <- data.frame(
    PATIENT_ID = c('PAT_011', 'PAT_012', 'PAT_013', 'PAT_014', 'PAT_015'),
    EVENT_NAME = c('event_3', 'event_1', 'event_3', 'event_2', 'event_3'),
    bp = c(78, 65, 71, 82, 79),
    bpfile = c(NA, '/files_dir/bp.pdf', '/files_dir/documnet.pdf', '', '/files_dir/bla.jpg'),
    other1 = c('NEG', 'POS', 'POS', '/files_dir/report.pdf','' )
)

> myresults <- search_csv()
> myresults
  PATIENT_ID EVENT_NAME var_name               file_name  source
1    PAT_001    event_1   bpfile       /files_dir/bp.pdf ex1.csv
2    PAT_003    event_1   bpfile /files_dir/documnet.pdf ex1.csv
3    PAT_005    event_2   bpfile      /files_dir/bla.jpg ex1.csv
4    PAT_005    event_2   result   /files_dir/report.pdf ex1.csv
5    PAT_012    event_1   bpfile       /files_dir/bp.pdf ex2.csv
6    PAT_013    event_3   bpfile /files_dir/documnet.pdf ex2.csv
7    PAT_015    event_3   bpfile      /files_dir/bla.jpg ex2.csv
8    PAT_014    event_2   result   /files_dir/report.pdf ex2.csv