The instructions are:
Create a tab delimited file with all the abstracts. Each "field of the record should have each own column; Presenter, Title, …, Abstract. The Keywords should be split into individual keywords (separate columns), where you should take into account that there are at most 6 keywords.
One text file looks like this: text file
Here is what I wrote so far but I'm not sure if it is correct.
files_to_read <- list.files(path="Abstracts")
creating_file <- function(abstracts) {
require(stringr)
lines <- readLines(con = abstracts)
}
#creating_file("Abstracts")
CodePudding user response:
This is a more difficult task than you think. First, create a new project in RStudio. Then create a files directory in the project directory and collect all your files there.
After that, you can run the script below.
library(tidyverse)
library(fs)
library(data.table)
readFile = function(fileName){
lines = fread(text = fileName, sep = NULL, header = FALSE)
tibble(txt = lines$V1[1:5]) %>% #1
separate(txt, c("name", "value"), sep = ": ") %>% #2
bind_rows(
tibble(
name = paste0("Keywords", 1:6),
value = lines[6] %>%
str_match("(^.*): (.*)") %>% .[,3] %>%
str_split(", ", 6) %>% .[[1]] %>% .[1:6]) #3
) %>% #4
bind_rows(
tibble(
name = "Abstract",
value = paste(lines$V1[8:nrow(lines)], collapse = " ")) #5
) %>% #6
pivot_wider(1:2) #7
}
files = dir_ls("files")
df = tibble()
for(file in files){
df = df %>% bind_rows(readFile(file))
}
df
df %>% write_csv("Result.csv")
Since you are a beginner, let me explain step by step how it works.
There is a file2.txt
in my files
directory. Here is its content.
Presenter: Ronald Beginer 2
Title: Exploiting
Format: Lecture
Session: 2_mode
Date and time: 03-14-2009 8:30am
Keywords: Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA, Method, Two-Mode Data, QCA
An innovative ...bla bla bla and other.
An innovative ...bla bla bla and other.
Now let me show you how my readFile
function works when used to read this file.
First, I read the entire file into the variable lines
lines = fread(text = fileName, sep = NULL, header = FALSE)
And then I make it into tibble
. Here are the next steps (see the comments).
Step 1 output
# A tibble: 5 x 1
txt
<chr>
1 Presenter: Ronald Beginer 2
2 Title: Exploiting
3 Format: Lecture
4 Session: 2_mode
5 Date and time: 03-14-2009 8:30am
Step 2 output
# A tibble: 5 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
Now watch out for step four we have to prepare a separate tibble
with exactly six keywords. This tibble
is made in step 3.
Step 3 output
# A tibble: 6 x 2
name value
<chr> <chr>
1 Keywords1 Method
2 Keywords2 Two-Mode Data
3 Keywords3 QCA
4 Keywords4 Method
5 Keywords5 Two-Mode Data
6 Keywords6 QCA, Method, Two-Mode Data, QCA
Step 4 output
# A tibble: 11 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
6 Keywords1 Method
7 Keywords2 Two-Mode Data
8 Keywords3 QCA
9 Keywords4 Method
10 Keywords5 Two-Mode Data
11 Keywords6 QCA, Method, Two-Mode Data, QCA
Similarly, for step 6, we need to create a separate tibble
that we attach to the rest. We create this tibble
in step 5.
Step 5 output
# A tibble: 1 x 2
name value
<chr> <chr>
1 Abstract An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
Step 6 output
# A tibble: 12 x 2
name value
<chr> <chr>
1 Presenter Ronald Beginer 2
2 Title Exploiting
3 Format Lecture
4 Session 2_mode
5 Date and time 03-14-2009 8:30am
6 Keywords1 Method
7 Keywords2 Two-Mode Data
8 Keywords3 QCA
9 Keywords4 Method
10 Keywords5 Two-Mode Data
11 Keywords6 QCA, Method, Two-Mode Data, QCA
12 Abstract An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
In the last step, we'll make it wide.
Step 7 output
# A tibble: 1 x 12
Presenter Title Format Session `Date and time` Keywords1 Keywords2 Keywords3 Keywords4 Keywords5 Keywords6 Abstract
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Ronald Beginer 2 Exploiting Lecture 2_mode 03-14-2009 8:30am Method Two-Mode Data QCA Method Two-Mode Data QCA, Method, Two-Mode Data, QCA An innovative ...bla bla bla a~
The rest is simple. Glue the thus obtained tibble
for each file and save to one csv file.
csv file
Presenter,Title,Format,Session,Date and time,Keywords1,Keywords2,Keywords3,Keywords4,Keywords5,Keywords6,Abstract
Ronald Beginer 1,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other.
Ronald Beginer 2,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,Method,Two-Mode Data,"QCA, Method, Two-Mode Data, QCA",An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
Ronald Beginer 3,Exploiting,Lecture,2_mode,03-14-2009 8:30am,Method,Two-Mode Data,QCA,NA,NA,NA,An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other. An innovative ...bla bla bla and other.
P.S. When writing questions on StackOverflow, never put any data in the picture form!!