I have a text that contains many articles concatenated into a single string. Each new article starts with = Article 1 =
followed by = = Article 1 Section 1 = =
, = = Article 1 Section 2 = =
and so on. I want to split this string and create a string for each article.
For that I am using regex split
import re
pattern = "=[\s\w\'\(\)] ="
l = re.compile(pattern).split(test_data)
But this isn't giving me the desired result. The article is splitting on sections and subsections as well. I tried excluding multiple =
s from matching but didn't find any success and not sure how to proceed on that. I have pasted sample data(two articles) here - Robert Boulder
and Kiss You ( One Direction song )
CodePudding user response:
This regex should do the job:
^ *= [^=]* = *$
See it working here:
https://regex101.com/r/l3tziI/1
Basically matching a '=' followed by a space, any numbers of characters that are NOT '=' (the [^=]
part), then another space and another '='. Also includes optional spaces at the start and end of the line because your sample text has leading and trailing spaces on some lines.