Home > Enterprise >  Overlapping regular expression substitution in Python, but contingent on values of capture groups
Overlapping regular expression substitution in Python, but contingent on values of capture groups

Time:02-25

I'm currently writing a program in Python that is supposed to transliterate all the characters in a language from one orthography into another. There are two things at hand here, one of which is already solved, and the second is the problem. In the first step, characters from the source orthography are converted into the target orthography, e.g.

š -> sh

ł -> lh

m̓ -> m’

l̓ -> l’

(ffr: the apostrophe-looking character is a single closing quotation mark.)

Getting closer to the problem: in certain cases, there are some graphemes that are written in a standard way (m’, n’, l’, y’, w’) that are written differently based on what's immediately around them. Specifically, the character may move to precede the consonant character if the grapheme is immediately following and preceding a vowel and the vowel that precedes the grapheme is at a higher 'level' in a hierarchy. It's sort of a complicated rule to explain, but here's some examples, where I include the first stage of the transliteration:

əm̓ə -> um’u -> um’u (no change)
əm̓i -> um’i -> um’i (no change)
im̓ə -> im’u -> i’mu (’ character moves to precede; i > u)
em̓i -> e’mi -> em’i (’ character moves to precede; e > i)

The hierarchy is that the character should move towards the vowel at the highest hierarchy as such: e > i > a > u

Here is the code that I have that deals with this second step pretty much as well as it can be done. It is pretty clean and takes care of the problem succinctly:

import re

def glottalized_resonant_mover(linestring):
    
    '''
    moves glottal character over according to glottalized resonant 
    hierarchy:

    case description: VR’W for some vowels V, W; some glottalized 
    resonant R’

    hierarchy: e > i > a > u
               3 > 2 > 1 > 0

    if h(V) > h(W), then string is V’RW
    
    '''

    hi_scores = {'e' : 3,
                'i' : 2,
                'a' : 1,
                'u' : 0}

    def hierarchy_sub(matchobj):
        '''moves glottalized resonant if a vowel pulls it one way
        or the other
        '''

        if hi_scores[matchobj.group(1)] > hi_scores[matchobj.group(4)]:

            swap_string = ''.join(
                [
                matchobj.group(1),
                matchobj.group(3),
                matchobj.group(2),
                matchobj.group(4)
                ]
            )
            return swap_string

        else:
            return matchobj.group(0)


    glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(a|e|i|u)')
    swapstring = glot_res_re.sub(hierarchy_sub, linestring)
    
    return swapstring

sample = ['’im’ush', 'ttham’uqwus', 'xwtsekwul’im’us']

for i in sample:
    print(glottalized_resonant_mover(i))

So, for when this code is given the transliterated words im’ush, ttham’uqwus, and xwtsekwul’im’us, it works perfectly for the first two words, but not the third. Summarized clearly:

’im’ush'         -> ’i’mush √
ttham’uqwus      -> ttha’muqwus √
xwtsekwul’im’us  -> xwtsekwul’im’us X should be: xwtsekwul’i’mus

The problem is that there are two capture groups in the third word: there's ul’i and then there's im’u which both share the i.

Now, this program is being fed lines of text, where the first stage of transliteration occurs, and then this second step should occur. Some documents are thousands of lines long, and there's a lot of these documents. There are also other things that I mean to implement (checking against wordlists, etc.) that will take up much computational power, so I'd like to keep this as quick as possible while still being comprehensible.

Also, it is true that I could just write a sequence for each and just have another big list of character sequences to replace, but then I lose some of the portability as well as the ability to easily make edits later.

So, if there's supposed to be a question: what is the best way to solve this problem that still preserves the approach and some of the qualities of my original solution?

CodePudding user response:

I'm not really clear on what you're after here. In particular, your code apparently doesn't always do what you want it to do ("it works perfectly for the first two words, but not the third"), but you haven't asked for a solution to that, nor given us enough information to know why the third word "is wrong".

So I'll just make stuff up ;-) Since re.sub() doesn't know about overlaps, I'd match multiple times in "priority" order, looking only for things of the form

(e)([lmnwy])(’)([iau])

first. A sequence of 3 similar patterns appears to capture all the rules you gave us, and they only match when something is in fact in need of swapping. Notes:

  • You don't have to write these "by hand". The code below constructs them.

  • Don't stress about speed. It's far too early for that, and "thousands of lines" should in fact be trivial to process on modern boxes. This is faster than you're guessing anyway, since the expense of calling the substitution function is never incurred unless a substitution needs to be made.

EDIT: since replacements are unconditional in this way, I changed this to use a fixed replacement string template instead of calling a substitution function with a match object argument.

R = "lmnwy"
V_in_order = "eiau"
pats = []
for i, vowel in enumerate(V_in_order[:-1]):
    pat = f"({vowel})([{R}])(’)([{V_in_order[i 1:]}])"
    print("pattern", repr(pat))
    pats.append(re.compile(pat))

# Not needed!
# def sub(m):
#    return m.group(1)   m.group(3)   m.group(2)   m.group(4)

def glottalized_resonant_mover(s):
    for p in pats:
        s = p.sub(r'\1\3\2\4', s)
    return s

sample = ['’im’ush', 'ttham’uqwus', 'xwtsekwul’im’us']
for i in sample:
    print(i, "->", glottalized_resonant_mover(i))

With output that appears to match what you want in all cases:

pattern '(e)([lmnwy])(’)([iau])'
pattern '(i)([lmnwy])(’)([au])'
pattern '(a)([lmnwy])(’)([u])'
’im’ush -> ’i’mush
ttham’uqwus -> ttha’muqwus
xwtsekwul’im’us -> xwtsekwul’i’mus```

CodePudding user response:

Yes, only very small changes are needed.

  VR'W -> V'RW

In fact, only the first 3 characters need to be manipulated, with 'W' as a necessary condition, so the problem we have to solve becomes:

  VR'(W) -> V'R

Using lookahead assertion: (? =...) can match VR'(W)

Previous: VR'W

(a|e|i|u)(l|m|n|w|y)(')(a|e|i|u)

The subsequent ones match only three letters but look forward one W: VR'(W)

(a|e|i|u)(l|m|n|w|y)(')(?=(a|e|i|u))

So 'W' is the condition, not in operation range, it can be matched again.

import re

def glottalized_resonant_mover(linestring):
    
    '''
    moves glottal character over according to glottalized resonant 
    hierarchy:

    case description: VR’W for some vowels V, W; some glottalized 
    resonant R’

    hierarchy: e > i > a > u
               3 > 2 > 1 > 0

    if h(V) > h(W), then string is V’RW
    
    '''

    hi_scores = {'e' : 3,
                'i' : 2,
                'a' : 1,
                'u' : 0}

    def hierarchy_sub(matchobj):
        '''moves glottalized resonant if a vowel pulls it one way
        or the other
        '''
        if hi_scores[matchobj.group(1)] > hi_scores[matchobj.group(4)]:

            swap_string = ''.join(
                [
                matchobj.group(1),
                matchobj.group(3),
                matchobj.group(2),
                #matchobj.group(4) <- Don't need the last one because 'lookahead'
                ]
            )
            return swap_string

        else:
            return matchobj.group(0)
       
    glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(?=(a|e|i|u))')
    # glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(a|e|i|u)')
    swapstring = glot_res_re.sub( hierarchy_sub, linestring)
    
    return swapstring

sample = ['’im’ush', 'ttham’uqwus', 'xwtsekwul’im’us']
answer =['’i’mush', 'ttha’muqwus', 'xwtsekwul’i’mus']
it1 = iter(sample)
it2 = iter(answer)
for i in sample:
    print(next(it1),'->',glottalized_resonant_mover(i), "==", next(it2))

Output:

’im’ush -> ’i’mush == ’i’mush
ttham’uqwus -> ttha’muqwus == ttha’muqwus
xwtsekwul’im’us -> xwtsekwul’i’mus == xwtsekwul’i’mus
  • Related