I am trying to compile a list of all lines in between the instances of each '\n\n' in the text below. So all characters up to a '\n\n' and then repeat to the next '\n\n', etc. I cannot figure out the correct regex for this situation. I attempted a positive lookbehind since I do not need the first instance included: "-DOCSTART- -X- O O". The text also has instances of '\n' which I think is tripping me up. I want to keep the single '\n' appearances and select the lines between '\n\n' only. If a positive lookbehind is not the most efficient, I am open to different solutions, this is only what I thought was best at the moment. Any suggestions?
Regex I am attempting: (?<=\\n\\n).
Text sample:
-DOCSTART- -X- O O\n\nEU NNP I-NP B-ORG\nrejects VBZ I-VP O\nGerman JJ I-NP B-MISC\ncall NN I-NP O\nto TO I-VP O\nboycott VB I-VP O\nBritish JJ I-NP B-MISC\nlamb NN I-NP O\n. . O O\n\nPeter NNP I-NP B-PER\nBlackburn NNP I-NP I-PER\n\nEU NNP I-NP B-ORG\nrejects VBZ I-VP O\nGerman JJ I-NP B-MISC\ncall NN I-NP O\nto TO I-VP O\nboycott VB I-VP O\nBritish JJ I-NP B-MISC\nlamb NN I-NP O\n. . O O\n\nPeter NNP I-NP B-PER\nBlackburn NNP I-NP I-PER\n\n
CodePudding user response:
This regex works: re.findall(r'(?=\n\n(. ))', text, re.S)
.
The re.S
flag is important because it allows \n
to be matched by the dot.
Note that text.split('\n\n')
suggested by Wiktor also works and is simpler.