Home > OS >  Extract all text before the last occurrence of a specific word
Extract all text before the last occurrence of a specific word

Time:05-23

I wanted to extract all the text before the last occurrence of a specific word:

Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.

The Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document

Clearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.

Appendix

Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.

And I want to all the text before the last Appendix (excluding Appendix)

currently using this

sub("(.*)Appendix", "", text)

but it only get all the text in the paragraph with the first Appendix. How do I adjust the regex.

Expected output:

Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.

The Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document

Clearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.

CodePudding user response:

We can find the location of 'Appendix', extract the last (dplyr) 'start'ing index, and get the substring (str_sub) from the start (1) to that location (wrap with trimws to remove any leading/lagging spaces)

library(stringr)
out <- trimws( str_sub(text, 1, dplyr::last(str_locate_all(text, 
    "Appendix")[[1]][,1])-1))

-output

> cat(out, "\n")
Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.

    The Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document

    Clearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available. 

data

text <- "\n\n    Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.\n\n    The Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document\n\n    Clearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.\n\n    Appendix\n\n    Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.\n"

CodePudding user response:

You can use the following code to select everything before the last occurence of appendix:

sub('-[^Appendix]*$', '', text)

Output:

[1] "Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.\n\nThe Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document\n\nClearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available.\n\nAppendix\n\nAny relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available."

CodePudding user response:

You can use this regex, which relies on negative look-ahead (?!...) and backreference \\1:

library(stringr)
str_extract(text, "(?i).*(?!(appendix).*\\1)")

How this works:

  • (?i): extract regardless of case
  • .* any stretch of characters until...
  • (?!(appendix).*\\1) ... the word appendix is matched followed by any stretch of text not including the word appendix

Data:

text <- "Please note again that and-approved forms are not to be count toward the page limit. To the extent possible, please limit the appendix file to one 50-page PDF submission.The Project Description must be clear, concise, and complete. is particularly interested in Project Descriptions that convey strategies for achieving intended performance. Project Descriptions are evaluated on the basis of substance and measurable outcomes, not length. Cross-referencing should be used rather than repetition. Supporting documents designated as required must be included in the Appendix of the document. Clearly identify the physical, economic, social, financial, institutional, and/or other problem(s) requiring a solution. The need for assistance, including the nature and scope of the problem, must be demonstrated. Supporting documentation, such as letters of support and testimonials from concerned parties, may be included in the Appendix. Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available. Appendix Any relevant data based on planning studies or needs assessments should be included or referred to in the endnotes or footnotes. Incorporate demographic data and participant/beneficiary information, as available."
  •  Tags:  
  • r
  • Related