I have strings of this form:
A_B_CDEF_GHI
A_B_C_DEF_G_H_I
ABC_D_E_F_GHI
ABCDEFG_H_I
A_B_C
I need to convert those to the following:
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
So the rules are:
(._){2,}
should be converted toXXX_
if it's not at the end of the string.- If
(_.){2,}
occurs at the end of a string, it should be converted to_XXX
. - If
(_.){2,}.
is the entire string, all underscores should be removed.
I've gotten to (((.)_){2,})
, which does match the first rule, but how can I replace it with the non-underscore characters it found?
The
python
tag is present because that's where the code is, and I know regex dialects depend on the language.
CodePudding user response:
The dot in your example code matches any character including an underscore. You can make the pattern a bit more specific instead.
You can get all of the double A-Z matches out of the way, and capture the single A-Z followed by _
and A-Z in a group.
Then for the capture group replace the _
with an empty string.
_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z])) )
_?[A-Z]{2,}_?
Match 2 or more occurences of A-Z surrounded by optional underscores|
or(
Capture group 1[A-Z]
Match a single A-Z(?:_[A-Z](?![A-Z]))
Repeat 1 times_
and A-Z asserting not A-Z to the right
)
Close group 1
See a regex demo and a Python demo
For example:
import re
pattern = r'_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z])) )'
s = ("A_B_CDEF_GHI\n"
"A_B_C_DEF_G_H_I\n"
"ABC_D_E_F_GHI\n"
"ABCDEFG_H_I\n"
"A_B_C")
res = re.sub(pattern, lambda x: x.group(1).replace("_", "") if x.group(1) else x.group(), s)
print(res)
Output
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
A bit broader match instead of characters A-Z could be using a negated character class matching any char except a whitespace char or underscore
_?[^_\s]{2,}_?|([^_\s](?:_[^_\s](?![^_\s])) )
CodePudding user response:
Here's a solution without regular expressions:
def convert(s: str) -> str:
""" https://stackoverflow.com/q/71578300 """
def _get_combined_parts() -> Iterator[str]:
"""
yields the ``_``-separated parts of ``s``
where subsequent single-character parts have been combined
"""
combined_part = ""
for part in s.split("_"):
if len(part) <= 1:
combined_part = part
else:
if combined_part:
yield combined_part
yield part
combined_part = ""
if combined_part:
yield combined_part
return "_".join(_get_combined_parts())