Home > OS >  How to capture everything between a '\n\n' in regex in python?
How to capture everything between a '\n\n' in regex in python?

Time:10-02

I am trying to compile a list of all lines in between the instances of each '\n\n' in the text below. So all characters up to a '\n\n' and then repeat to the next '\n\n', etc. I cannot figure out the correct regex for this situation. I attempted a positive lookbehind since I do not need the first instance included: "-DOCSTART- -X- O O". The text also has instances of '\n' which I think is tripping me up. I want to keep the single '\n' appearances and select the lines between '\n\n' only. If a positive lookbehind is not the most efficient, I am open to different solutions, this is only what I thought was best at the moment. Any suggestions?

Regex I am attempting: (?<=\\n\\n).

Text sample:

-DOCSTART- -X- O O\n\nEU NNP I-NP B-ORG\nrejects VBZ I-VP O\nGerman JJ I-NP B-MISC\ncall NN I-NP O\nto TO I-VP O\nboycott VB I-VP O\nBritish JJ I-NP B-MISC\nlamb NN I-NP O\n. . O O\n\nPeter NNP I-NP B-PER\nBlackburn NNP I-NP I-PER\n\nEU NNP I-NP B-ORG\nrejects VBZ I-VP O\nGerman JJ I-NP B-MISC\ncall NN I-NP O\nto TO I-VP O\nboycott VB I-VP O\nBritish JJ I-NP B-MISC\nlamb NN I-NP O\n. . O O\n\nPeter NNP I-NP B-PER\nBlackburn NNP I-NP I-PER\n\n

CodePudding user response:

This regex works: re.findall(r'(?=\n\n(. ))', text, re.S).

The re.S flag is important because it allows \n to be matched by the dot.

Note that text.split('\n\n') suggested by Wiktor also works and is simpler.

  • Related