Home > database >  How to make a tab delimited file based on many txt files?
How to make a tab delimited file based on many txt files?

Time:12-04

The instructions are:

Create a tab delimited file with all the abstracts. Each "field of the record should have each own column; Presenter, Title, …, Abstract. The Keywords should be split into individual keywords (separate columns), where you should take into account that there are at most 6 keywords.

One text file looks like this: text file

Here is what I wrote so far but I'm not sure if it is correct.

files_to_read <- list.files(path="Abstracts")
                            
creating_file <- function(abstracts) {
  require(stringr)
  lines <- readLines(con = abstracts)
 
} 

#creating_file("Abstracts")

CodePudding user response:

This is a more difficult task than you think. First, create a new project in RStudio. Then create a files directory in the project directory and collect all your files there.

After that, you can run the script below.

library(tidyverse)
library(fs)
library(data.table)



readFile = function(fileName){
  lines = fread(text = fileName, sep = NULL, header = FALSE)
  
  tibble(txt = lines$V1[1:5]) %>% #1
    separate(txt, c("name", "value"), sep = ": ") %>% #2
    bind_rows(
      tibble(
        name = paste0("Keywords", 1:6),
        value = lines[6] %>% 
          str_match("(^.*): (.*)") %>% .[,3] %>% 
          str_split(", ", 6) %>% .[[1]] %>% .[1:6]) #3
    ) %>% #4
    bind_rows(
      tibble(
        name = "Abstract",
        value = paste(lines$V1[8:nrow(lines)], collapse = " ")) #5
    ) %>% #6
    pivot_wider(1:2) #7
}

files = dir_ls("files")
df = tibble()
for(file in files){
  df = df %>% bind_rows(readFile(file))
}
df

df %>% write_csv("Result.csv")

Since you are a beginner, let me explain step by step how it works.

There is a file2.txt in my files directory. Here is its content.

Presenter: Ronald Beginer 2
Title: Exploiting
Format: Lecture
Session: 2_mode
Date and time: 03-14-2009 8:30am
Keywords: Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA

An innovative ...bla bla bla and other.
An innovative ...bla bla bla and other.

Now let me show you how my readFile function works when used to read this file. First, I read the entire file into the variable lines

lines = fread(text = fileName, sep = NULL, header = FALSE)

And then I make it into tibble. Here are the next steps (see the comments).

Step 1 output

# A tibble: 5 x 1
  txt                             
  <chr>                           
1 Presenter: Ronald Beginer 2     
2 Title: Exploiting               
3 Format: Lecture                 
4 Session: 2_mode                 
5 Date and time: 03-14-2009 8:30am

Step 2 output

# A tibble: 5 x 2
  name          value            
  <chr>         <chr>            
1 Presenter     Ronald Beginer 2 
2 Title         Exploiting       
3 Format        Lecture          
4 Session       2_mode           
5 Date and time 03-14-2009 8:30am

Now watch out for step four we have to prepare a separate tibble with exactly six keywords. This tibble is made in step 3.

Step 3 output

# A tibble: 6 x 2
  name      value                          
  <chr>     <chr>                          
1 Keywords1 Method                         
2 Keywords2 Two-Mode Data                  
3 Keywords3 QCA                            
4 Keywords4 Method                         
5 Keywords5 Two-Mode Data                  
6 Keywords6 QCA, Method, Two-Mode Data, QCA

Step 4 output

# A tibble: 11 x 2
   name          value                          
   <chr>         <chr>                          
 1 Presenter     Ronald Beginer 2               
 2 Title         Exploiting                     
 3 Format        Lecture                        
 4 Session       2_mode                         
 5 Date and time 03-14-2009 8:30am              
 6 Keywords1     Method                         
 7 Keywords2     Two-Mode Data                  
 8 Keywords3     QCA                            
 9 Keywords4     Method                         
10 Keywords5     Two-Mode Data                  
11 Keywords6     QCA, Method, Two-Mode Data, QCA

Similarly, for step 6, we need to create a separate tibble that we attach to the rest. We create this tibble in step 5.

Step 5 output

# A tibble: 1 x 2
  name     value                                                                          
  <chr>    <chr>                                                                          
1 Abstract An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.

Step 6 output

# A tibble: 12 x 2
   name          value                                                                          
   <chr>         <chr>                                                                          
 1 Presenter     Ronald Beginer 2                                                               
 2 Title         Exploiting                                                                     
 3 Format        Lecture                                                                        
 4 Session       2_mode                                                                         
 5 Date and time 03-14-2009 8:30am                                                              
 6 Keywords1     Method                                                                         
 7 Keywords2     Two-Mode Data                                                                  
 8 Keywords3     QCA                                                                            
 9 Keywords4     Method                                                                         
10 Keywords5     Two-Mode Data                                                                  
11 Keywords6     QCA, Method, Two-Mode Data, QCA                                                
12 Abstract      An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.

In the last step, we'll make it wide.

Step 7 output

# A tibble: 1 x 12
  Presenter        Title      Format  Session `Date and time`   Keywords1 Keywords2     Keywords3 Keywords4 Keywords5     Keywords6                       Abstract                       
  <chr>            <chr>      <chr>   <chr>   <chr>             <chr>     <chr>         <chr>     <chr>     <chr>         <chr>                           <chr>                          
1 Ronald Beginer 2 Exploiting Lecture 2_mode  03-14-2009 8:30am Method    Two-Mode Data QCA       Method    Two-Mode Data QCA, Method, Two-Mode Data, QCA An innovative ...bla bla bla a~

The rest is simple. Glue the thus obtained tibble for each file and save to one csv file.

csv file

Presenter,Title,Format,Session,Date and time,Keywords1,Keywords2,Keywords3,Keywords4,Keywords5,Keywords6,Abstract
Ronald Beginer 1,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other.
Ronald Beginer 2,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,Method,Two-Mode Data,"QCA, Method, Two-Mode Data, QCA",An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
Ronald Beginer 3,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.

P.S. When writing questions on StackOverflow, never put any data in the picture form!!

  • Related