As per the title, I am trying to clean a large compilation of short texts, to remove sentences that start with certain words -- but only if it is the last of >1 sentences that text.
Suppose I want to cut out the last sentence if it begins with 'Jack is ...'
Here is an example with varied cases:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
And here is the regex I currently have: "(?![A-Z']. [\\.|'] )[Jj]ack,? is. \\.$"
map_chr(test_strings, ~str_replace(.x, "(?![A-Z']. [\\.|'] )[Jj]ack,? is. \\.$", "[TRIM]"))
Producing these results:
[1] "[TRIM]"
[2] "and [TRIM]"
[3] "There are mirrors. And [TRIM]"
[4] "There are dogs. And [TRIM]"
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
## Basically my current regex is still too greedy.
## No trimming should happen for the first 4 examples.
## 5 - 7th examples are correct.
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it.
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets.
Thanks for any help!
CodePudding user response:
gsub("^(.*\\.)\\s*Jack,? is[^.]*\\.?$", "\\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."
# [2] "and Jack is the one who said, let there be fries."
# [3] "There are mirrors. And Jack is there to be suave."
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"
Break-down:
^(.*\\.)\\s*
: since we need there to be at least one sentence before what we trim out, we need to find a preceding dot\\.
;Jack,? is
from your regex[^.]*\\.?$
: zero or more "not.
-dots" followed by a.
-dot and end-of-string; if you want to allow blank space after the last period, then you can change this to[^.]*\\.?\\s*$
, didn't seem necessary in your example
CodePudding user response:
You can match a dot (or match more chars using a character class [.!?]
and then match the last sentence containing Jack and end with a dot (or again the character class to match more chars):
\.\K\h*[Jj]ack,? is[^.\n]*\.$
The pattern matches:
\.\K
Match a.
and forget what is matched so far\h*[Jj]ack,? is
Match optional horizontal whitespace chars, then Jack or jack, and optional comma andis
[^.\n]*\.
Optionally match any char except a.
or a newline$
End of string
Example code:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
sub("\\.\\K\\h*[Jjack,? is[^.\\n]*\\.$", " [TRIM]", test_strings, perl=TRUE)
Output
[1] "Jack is the tallest person."
[2] "and Jack is the one who said, let there be fries."
[3] "There are mirrors. And Jack is there to be suave."
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"