Home > database >  I have an R function to extract information from one document. How do I loop that for all the docume
I have an R function to extract information from one document. How do I loop that for all the docume

Time:12-09

I have a folder of txt files, and I want to extract specific texts from them and arrange them separate columns into a new data frame. I did the code for one file, but I can't seem to edit it into a loop that will run across all the documents in my folder.

This is my code for the one txt file:

    clean_text <- as.data.frame(strsplit(text$text, '\\*' ), col.names = "text") %>% 
mutate(text = str_replace_all(text, "\n", " "),
         text = str_replace_all(text, "- ", ""), 
         text = str_replace_all(text,"^\\s", "")) %>% 
  
  filter(!text == " ") %>% 
  
  mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 
  
  rename(category = text) %>% 
  mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
  fill(category) %>% 
  filter(!is.na(paragraphs)) %>% 
  
  mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]{1,3}\\.|\\t\\s[[:digit:]]{1,3}\\.')) %>% 
  unnest(paragraphs) %>% 
  mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
  unnest(paragraphs) %>% 
  mutate(paragraphs = str_replace_all(paragraphs, "\t", "")) %>% 
  mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
  mutate(paragraphs = str_replace_all(paragraphs, "^\\s ", "")) %>%
  filter(!paragraphs == "") 

How do I make this into a loop? I realise there are similar questions, but none of the solutions have worked for me. Thanks in advance for the help!

CodePudding user response:

I'm not using a loop but using lapply and a function has the same behavior of a loop :

my_path <- "C:/Users/SAID ABIDI/Desktop/test/"
my_a <- list.files(path = my_path)

my_function <- function(x) {
  read_file(paste(my_path, my_a[x], sep = ""))
}
my_var <- lapply(1:length(my_a), my_function)

Does this help you ?

CodePudding user response:

Put your code in a function:

extract_info = function(file) {
  ## Add the code you need to read the text from the file
  ## Something like
  ## text <- readLines(file)
  ## or whatever you are using to read in the file
  clean_text <- as.data.frame(strsplit(text$text, '\\*' ), col.names = "text") %>% 
  mutate(text = str_replace_all(text, "\n", " "),
           text = str_replace_all(text, "- ", ""), 
           text = str_replace_all(text,"^\\s", "")) %>% 
    
    filter(!text == " ") %>% 
    
    mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 
    
    rename(category = text) %>% 
    mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
    fill(category) %>% 
    filter(!is.na(paragraphs)) %>% 
    
    mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]{1,3}\\.|\\t\\s[[:digit:]]{1,3}\\.')) %>% 
    unnest(paragraphs) %>% 
    mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
    unnest(paragraphs) %>% 
    mutate(paragraphs = str_replace_all(paragraphs, "\t", "")) %>% 
    mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
    mutate(paragraphs = str_replace_all(paragraphs, "^\\s ", "")) %>%
    filter(!paragraphs == "") 
}

Test your function to make sure it works on one file:

extract_info("your_file_name.txt")
## does the result work and look right? 
## work on your function until it does

Get a list of all the files you want to run

my_files = list.files()
## by default this will give you all the files in your working directory
## use the `pattern` argument if you only want files that follow
## a certain naming convention

Apply your function to those files:

results = lapply(my_files, extract_info)
  • Related