I want to find paragraph around a word using regex expressions, start and end of paragraph is identified by delimiter '@@'. I am using alteryx regex tool with tokenize method, its perl 5 compatible.
e.g. Text:
@@Consumers can also monitor their accounts regularly by allowing them to keep their accounts safe. Around-the-clock access to banking information provides early detection of fraudulent activity, thereby acting as a guardrail against financial damage or loss.@@ Online Bill Payment one of the great advantages of online banking is online bill pay. Rather than having to write checks or fill out forms to pay bills, once you set up your accounts at your online bank, all it takes is a simple click or even less, as you can usually automate your bill payments. With online bill pay, it’s easy to manage your accounts from one central source and to track payments into and out of your account.@@ In spite of their many advantages, there are some drawbacks to using online banks as well. Here are some of the downsides/drawback of working with an online bank @@
Case:
if i specify word "one central source", it should extract para from starting n ending with delimiter '@@'
output:
Online Bill Payment one of the great advantages of online banking is online bill pay. Rather than having to write checks or fill out forms to pay bills, once you set up your accounts at your online bank, all it takes is a simple click or even less, as you can usually automate your bill payments. With online bill pay, it’s easy to manage your accounts from one central source and to track payments into and out of your account.
\bone central source\b(.*?)@@
https://regex101.com/r/IbZEkd/1
CodePudding user response:
If the tool is perl5 compatible, you can use:
@@\s* \K(?:.(?!@@))*\bone central source\b.*?(?=@@)
Explanation
(?s)
Inline modifier, have the dot match a newline@@\s* \K
Match@@
, match optional whitespace chars and then clear the match buff3er(?:.(?!@@))*
Match any char when not directly followed by @@\bone central source\b
Match literally between word boundaries to prevent partial word matches.*?
Match any char, as least as possible(?=@@)
Positive lookahead, assert @@ to the right
CodePudding user response:
Something like this should work: /@@\s*((?:.(?!@@))*?\bone central source\b.*?)\s*@@/gs
Testing it here: https://regex101.com/r/s9e1ej/1
The idea is to search for the @@
possibly followed by spaces and then any char which isn't followed by @@
. This can be done with a negative lookahead:
.(?!@@)
meaning anything not followed by@@
.(?:.(?!@@))*?
is this same pattern inside a non-capturing group which can be repeated but with the ungreedy option. This is to avoid eating your sentence.
As you can see in the example, the text can contain the @
symbol like I did by adding an e-mail address in the text.
Then, as you did, search for the sentence you are looking for with the word boundary \b
. I removed the case-insensitive flag so you might need to re-enable it if your sentence can be written in another case.
If you don't want to get the delimiting separator, you could put the middle part in a capturing group. And if you can't use a group with your tool then look at The fourth bird's nice solution which is using the \K
reset and a positive lookahead at the end.