I am trying to extract text between two words. The below pattern repeats itself with modifications in between 'start keyword' and 'end keyword' across the text document. The document has paragraphs and text before and after the following patterns, which i don't want to extract. Can anyone help me with the regex for the following ? which would extract all occurrences.
Start keyword- RIASWIX End keyword - Sky Access
----Document Start-------
Paragraph*
RIASWIX.* ABCDEF1 NONE
WORKING: HELLO(READ)
BOOLEAN Access: SADGRE3, VJFKES3, JGJKEWW, IS4DWF44(A), DFEAWE2(G),
DW4444W, IHFK3MF3
BAZAAR Access: No resource with BAZAAR Access
GHAR Access: No resource with GHAR Access
WATER Access: ADMINDDD(A), GEDDE33
SKY None: No Resource with Sky Access
RIASWIX.@7483NFJ.* HFDFDF3 NONE
WORKING: BYE(READ)
BOOLEAN Access: GRREGGG, GREFEFF, GFGGGG, FDFDFDF(A), RERERE3(G),
GFFWEF44, FFRF44F
BAZAAR Access: No resource with BAZAAR Access
GHAR Access: No resource with GHAR Access
WATER Access: ADMINEWW(A), FFRFRGR
SKY None: No Resource with Sky Access
RIASWIX.@7483KXX.* HFDFDF3 NONE
WORKING: TATA(READ)
BOOLEAN Access: GRDSD33, FASDE, GFGGGG, RWERW33(A), NMUYHT4(G),
BAZAAR Access: XCDFEFE3, FREFE33R
GHAR Access: No resource with GHAR Access
WATER Access: DASDEFG(A), SJMFEIOE(P)
SKY None: No Resource with Sky Access
*Text
----Document End-------
CodePudding user response:
(?s)
for new line characters, check this regex-match-all-characters-between-two-strings
import re
print(re.findall('RIASWIX(?s)(.*?)Sky Access', str1))
CodePudding user response:
You added to your question Python
and Java
as tags. I can answer you regarding Java.
To implement your regex you need to:
Use a positive lookbehind and a positive lookahead to match and exclude the keywords at the beginning and at the end of every occurrence.
Then, you should use a reluctant quantifier to only match the text in between a pair of keywords, or else you would match the whole text between the first and last keyword.
Finally, your regex should enable the
DOTALL
flag in order to match the text across multiple lines.
Here is an implementation with your example:
https://regex101.com/r/6Lnm5i/1
String text = "... your text to parse ....";
//Creating a regex with the DOTALL mode enabled. Eventually you could add the flag within your regex by adding at the beginning (?s)
Pattern regex = Pattern.compile("(?<=RIASWIX).*?(?=Sky Access)", Pattern.DOTALL);
//Creating a matcher built on your regex and the text to parse
Matcher matcher = regex.matcher(text);
//While there are still occurrences
while(matcher.find()){
//Printing the occurrence
System.out.println(matcher.group());
}