Home > Software engineering >  Text processing and analysis in R
Text processing and analysis in R

Time:01-05

I am beginning the analysis in RStudio of an interview I have made. The interview is, normally, made of the interviewer's questions and the subject's answers.

text<- "Interviewer: Hello, how are you?
Subject: I am fine, thanks.

Interviewer: What is your name?
Subject: My name is Gerard."

I would like to remove all the interviewer's questions to be able to analyze the interview. I do not know how to proceed in R, actually, I do not even know what exactly to Google.

I would appreciate all the help I can get. Thank you in advance.

CodePudding user response:

base R:

text<- "Interviewer: Hello, how are you?
Subject: I am fine, thanks.

Interviewer: What is your name?
Subject: My name is Gerard."

this gives you

text
[1] "Interviewer: Hello, how are you?\nSubject: I am fine, thanks.\n\nInterviewer: What is your name?\nSubject: My name is Gerard."

where the \n are that you split on with strsplit(

strsplit(text, '\n')[[1]] # strsplit returns a list
[1] "Interviewer: Hello, how are you?" "Subject: I am fine, thanks."     
[3] ""                                 "Interviewer: What is your name?" 
[5] "Subject: My name is Gerard."
text2 <- strsplit(text, '\n\)

text2[c(2,5)]
[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."

CodePudding user response:

If your data is a vector text as indicated in the question, we can do:

It seems that your data is stored in text -> then try this:

With as_tibble wit transform the vector to a tibble ( /- equal to data frame), then we separate the rows by \n and finally we filte:

library(dplyr)
library(tidyr)

text <- as_tibble(text) %>% 
  separate_rows(value, sep="\n") %>% 
  filter(!grepl("Interviewer", value) & value!="") %>% 
  pull(value)
text
[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."

CodePudding user response:

An approach using strsplit and sub/gsub.

text_new <- gsub("\n", "", sub(".*(Subject: )", "\\1", 
              unlist(strsplit(text, "Interviewer: "))))
text_new[nchar(text_new) > 0]
[1] "Subject: I am fine, thanks." "Subject: My name is Gerard."
  • First split the string using Interviewer:.
  • Since the first string includes Subject: remove the residual string until Subject: with sub
  • Remove existing newlines with gsub.
  • Finally select non-empty strings.
  • Related