Home > Back-end >  Regex to split on two or more instances of a punctuation mark, and only one or more of the others
Regex to split on two or more instances of a punctuation mark, and only one or more of the others

Time:11-21

I am new to regex and I am trying to write one (Python flavour) that would allow me to split at every punctuation mark or whitespace, except for the single hyphen (e.g. 9-5, Mon-Fri would not be split) . However, the text that I want to process sometimes contains a sequence of hyphens like -------------, used for separating paragraphs or thematically distinct sections of the document. Therefore, I want a solution that splits on one or more occurrences of every punctuation mark except the hyphen, and that splits on a combination of 2 or more hyphens.

I have tried with the following code:

re.split(r"[-{2,}\.,:\s]", mystring)

but the -{2,} part gets interpreted literally. I have also tried to incorporate it into a group, but again, the parentheses are interpreted literally. I am aware that I could write a first regex to replace multiple hyphens with the null character, and a second regex that looks at all other whitespace and punctuation marks; however, I am wondering if there is a way to do it in a single step.

CodePudding user response:

Most things inside of a character class [...] is a literal EXCEPT a hyphen in certain contexts and backslash (and / in some regex flavors...). So [-{2,}\.,:\s] is matching all literal characters except for \s. There are other character class operators referenced HERE such as ^ but most regex metacharacters no longer work inside a character class.

I think you might be looking for alteration:

[,.\/]|-{2,}
 ^            add whatever punctuation you want to split on

(In Python, without the notion of opening a regex, you can use / inside a character class without escaping it: [,./]|-{2,})

Demo

  • Related