Home > OS >  Deleting everything after after a certain string in R
Deleting everything after after a certain string in R

Time:11-27

I have some data in an object called all_lines that is a character class in R (as a result of reading into R a PDF file). My objective: to delete everything before a certain string and delete everything after another string.

The data looks like this and it is stored in the object all_lines

class(all_lines)
"character"


[1] LABORATORY               Research Cover Sheet
[2] Number       201111EZ         Title   Maximizing throughput"
[3] "                                     in Computers
[4] Start Date   01/15/2000 
....
[49] Introduction
[50] Some text here and there
[51] Look more text
....
[912] Citations
[913] Author_1 Paper or journal
[914] Author_2 Book chapter

I want to delete everything before the string 'Introduction' and everything after 'Citations'. However, nothing I find seems to do the trick. I have tried the following commands from these posts: How to delete everything after a matching string in R and multiple on-line R tutorials on how to do just this. Here are some commands that I have tried and all I get is the string 'Introduction' deleted in the all_lines with everything else returned.

str_remove(all_lines, "^.*(?=(Introduction))")
sub(".*Introduction", "", all_lines)
gsub(".*Introduction", "", all_lines)

I have also tried to delete everything after the string 'Citations' using the same commands, such as:

sub("Citations.*", "", all_lines)

Am I missing anything? Any help would really be appreciated!

CodePudding user response:

Assuming you can accept a single string as output, you could collapse the input into a single string and then use gsub():

all_lines <- paste(all_lines, collapse = " ")
output <- gsub("^.*?(?=\\bIntroduction\\b)|(?<=\\bCitations\\b).*$", "", all_lines)

CodePudding user response:

It looks like your variable is vector of character strings. One element per line in the document.

We can use the grep() function here to locate the lines containing the desired text. I am assuming only 1 line contains "Introduction" and only 1 line contains "Citations"

#line numbers containing the start and end
Intro <- grep("Introduction", all_lines)
Citation <- grep("Citations", all_lines)

#extract out the desired portion.
abridged <- all_lines[Intro:Citation]

You may need to add 1 or substract 1 if you would like to actually remove the "Introduction" or "Citations" line.

  • Related