I need to split pdf files into their chapters. In each pdf, at the beginning of every chapter, I added the word "Hirfar" for which to look and split the text. Consider the following example:
t <- c(" Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.
Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.
Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.
Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”
Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health.")
here I used this code to break it into its words:
library(stringr)
wrds <- str_split(t, pattern = boundary(type = "word")
now, I want to look for the word "Hirfar" and separate this text into 5 different texts. Each of which must include the first word after Hirfar up to the next word before Hirfar.
CodePudding user response:
We may use regex lookaround
strsplit(t, "\\s (?=Hirfar)", perl = TRUE)[[1]][-1]
-output
[1] "Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”."
[2] "Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”."
[3] "Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said."
[4] "Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”"
[5] "Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."
If it shouldn't include Hirfar
strsplit(t, "Hirfar\\s ")[[1]][-1]
[1] "Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.\n\n"
[2] "In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.\n\n "
[3] "“At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.\n\n"
[4] "He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”\n\n"
[5] "Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."