Home > Back-end >  Exclude some characters from a regex group
Exclude some characters from a regex group

Time:12-31

I have a text that contains many articles concatenated into a single string. Each new article starts with = Article 1 = followed by = = Article 1 Section 1 = =, = = Article 1 Section 2 = = and so on. I want to split this string and create a string for each article.

For that I am using regex split

import re
pattern = "=[\s\w\'\(\)] ="
l = re.compile(pattern).split(test_data)

But this isn't giving me the desired result. The article is splitting on sections and subsections as well. I tried excluding multiple =s from matching but didn't find any success and not sure how to proceed on that. I have pasted sample data(two articles) here - Robert Boulder and Kiss You ( One Direction song )

CodePudding user response:

This regex should do the job:

^ *= [^=]* = *$

See it working here:

https://regex101.com/r/l3tziI/1

Basically matching a '=' followed by a space, any numbers of characters that are NOT '=' (the [^=] part), then another space and another '='. Also includes optional spaces at the start and end of the line because your sample text has leading and trailing spaces on some lines.

  • Related