I am using Python and have the following regular expression to extract text from text files:
pattern = r'\bItem\s 5\.02\s*([\w\W]*?)(?=\s*(?:Item\s [89]\.01|Item\s 5\.03|Item\s 5\.07|SIGNATURES|SIGNATURE|Pursuant to the requirements of the Securities Exchange Act of 1934)\b)'
pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)
My issue is specifically with the last term, "Pursuant to the requirements of the Securities Exchange Act of 1934". In the text files, this sentence is sometimes spaced randomly and starts different parts of the sentence on new lines. How do I account for this randomness? Right now it is only picking it up when it is written with even, normal spacing.
CodePudding user response:
s = "Pursuant to the requirements of the Securities Exchange Act of 1934"
reg = re.compile(r'[A-Za-z\s] .*?1934$')
print(reg.search(s))
<re.Match object; span=(0, 67), match='Pursuant to the requirements of the Securities Ex>
CodePudding user response:
Simply change the spaces in the regex to the whitespace character class:
pattern = r'\bItem\s 5\.02\s*([\w\W]*?)(?=\s*(?:Item\s [89]\.01|Item\s 5\.03|Item\s 5\.07|SIGNATURES|SIGNATURE|' r'Pursuant to the requirements of the Securities Exchange Act of 1934)\b)'.replace(' ', '\s*')