I am new to regex and I am trying to write one (Python flavour) that would allow me to split at every punctuation mark or whitespace, except for the single hyphen (e.g. 9-5
, Mon-Fri
would not be split) . However, the text that I want to process sometimes contains a sequence of hyphens like -------------
, used for separating paragraphs or thematically distinct sections of the document. Therefore, I want a solution that splits on one or more occurrences of every punctuation mark except the hyphen, and that splits on a combination of 2 or more hyphens.
I have tried with the following code:
re.split(r"[-{2,}\.,:\s]", mystring)
but the -{2,}
part gets interpreted literally. I have also tried to incorporate it into a group, but again, the parentheses are interpreted literally.
I am aware that I could write a first regex to replace multiple hyphens with the null character, and a second regex that looks at all other whitespace and punctuation marks; however, I am wondering if there is a way to do it in a single step.
CodePudding user response:
Most things inside of a character class [...] is a literal EXCEPT a hyphen in certain contexts and backslash (and /
in some regex flavors...). So [-{2,}\.,:\s]
is matching all literal characters except for \s
. There are other character class operators referenced HERE such as ^
but most regex metacharacters no longer work inside a character class.
I think you might be looking for alteration:
[,.\/]|-{2,}
^ add whatever punctuation you want to split on
(In Python, without the notion of opening a regex, you can use /
inside a character class without escaping it: [,./]|-{2,}
)