Powershell Regex for German Umlaute based on upper/lowercase and position in String-CodePudding

I am trying to write a Script in Powershell to convert German Umlaute

ä, ö, ü, ß to ae, oe, ue, ss 

Ä, Ö, Ü, ß to AE or Ae, UE or Ue, and SS.

The Problem is that i also need to differentiate based on the Position of the Umlaut.

ÜNLÜ > UENLUE
Ünlü > Uenlue (Ue)
SCHNEEWEIß > SCHNEEWEISS
Schneeweiß > Schneeweiss
Geßl > Gessl
GEßL > GESSL
Josef Öbinger > Josef Oebinger (one string)
Jürgen MÜLLER > Juergen MUELLER (one string)

The main Problem ruining my day is the Umlaut ß

There is no upper and lower case for ß

I need to identify ß based on wether the previous character was uppercase or lowercase

I have tried various regex like [ÄÖÜßA-Z]{1,}(?![\sa-zäüö])[ÄÖÜßA-Z] or [ÄÖÜß][^a-z]

It is basically impossible for me to figure out ss or SS. Apart from that, words like ÜNLÜ only get recognised with only one Umlaut because the letter with the umlaut is at the end of the Word.

I need 3 matching regex patterns. One for uppercase and one for lowercase and one for mixed case (Oebinger)

Those 3 Patterns will then be put inside 3 IF conditions in powershell where i can then blindly convert based on the matched pattern.

[ÄÖÜß][^a-z] works for ÜNLÜ > UENLUE

[äöüß][^A-Z] works for Jürgen > Juergen

but the ß in Schneeweiß and SCHNEEWEIß is matched with both patterns. That is not what i want.

I need a pattern that can check wether the letter before and after ß is lowercase or uppercase. If lowercase than ß = ss, if uppercase then ß = SS

The 3rd case, the mixed case does not really require a separate regex. I could basically take the String Jürgen MÜLLER, run it in powerscript through both patterns. First Pattern would convert it to Jürgen MUELLER. Take this and run it again to get Juergen MUELLER.

The Umlaut ß is always same. Lowercase = Uppercase. This is what makes the whole thing so difficult.

I am losing hope. Please help me guys.

Thank You and Best Regards.

CodePudding user response：

PowerShell (Core) 7 offers a concise solution, given that the -replace operator there accepts a script block as the substitution operand, which enables flexible, dynamic substitutions based on each match found:

$strings = @(
  'ÜNLÜ'          # > UENLUE
  'Ünlü'          # > Uenlue (Ue)
  'SCHNEEWEIß'    # > SCHNEEWEISS
  'Schneeweiß'    # > Schneeweiss
  'Geßl'          # > Gessl
  'GEßL'          # > GESSL
  'Josef Öbinger' # > Josef Oebinger (one string)
  'Jürgen MÜLLER '# > Juergen MUELLER (one string)  
)

$strings `
  -replace '[äöü].?', { 
    ([string] $_.Value[0]).Normalize('FormD')[0]   
      ([char]::IsUpper($_.Value[1] ?? $_.Value[0]) ? 'E' : 'e')
  } `
  -replace '.ß', { 
    $_.Value[0]   ([char]::IsUpper($_.Value[0]) ? 'SS' : 'ss') 
  }

Note:

Calling .Normalize('FormD')[0] on a string containing a single umlaut character in effect converts that character to its ASCII base letter; for instance, ü becomes u - see System.String.Normalize.

In Windows PowerShell (the legacy, Windows-only edition whose latest and last version is v5.1):

you need to call the underlying .NET API directly, namely [regex]::Replace()
you also need to use if statements in lieu of the ternary operator (<condition> ? <if-true> : <else>) and the null-coalescing operator (??), which are also only available in PowerShell (Core) 7 .

As a result, the solution is significantly more complex:

$strings | ForEach-Object {
  $aux = 
    [regex]::Replace(
      $_,
      '[äöü].?',
      { 
        param($m) 
        ([string] $m.Value[0]).Normalize('FormD')[0]  
          $(if ([char]::IsUpper($(if ($m.Value[1]) { $m.Value[1] } else { $m.Value[0] }))) { 'E' } else { 'e' })
      },
      'IgnoreCase'
    )  
  [regex]::Replace(
    $aux,
    '.ß',
    { 
      param($m) 
      $m.Value[0]   $(if ([char]::IsUpper($m.Value[0])) { 'SS' } else { 'ss' }) 
    },
    'IgnoreCase'
  )  
}

CodePudding user response：

Thanks for such an interesting question!

There are two ways I see you could approach this.

The approach you are currently taking seems to be trying to do this within a replacement string. This may work, though I'd suspect you'd want to use -creplace or an explicitly case-sensitive Regex.

The approach I would try would be using a Regex Replacement evaluator. These are fairly easy to do in PowerShell, since you can cast a [ScriptBlock] to any delegate.

I believe this script will do the trick:

$inputString = @'
ÜNLÜ
Ünlü
SCHNEEWEIß
Schneeweiß
Geßl
GEßL
Josef Öbinger
Jürgen MÜLLER
'@

$UmulatesPattern = [Regex]::New('[ÄÖÜäöüß]')
$UmulatesPattern.Replace($InputString,{
    param($match)
    $wasCapitalized = $match.Value -cmatch '\p{Lu}'
    
    $lastCharacter = 
        if ($match.Index -gt 1) {
            $inputString[$match.Index - 1]
        } else { ' ' }

    $nextCharacter = 
        if ($match.Index -lt ($inputString.Length - 2)) {
            $inputString[$match.Index   1]
        } else { ' ' }

    $shouldCapitalizeAll = 
        $lastCharacter -cmatch '[\s\p{Lu}]' -and
        $NextCharacter -cmatch '[\s\p{Lu}]'
    
    $replacement = 
        switch ($match) {
            "ä" {"ae"}
            "ö" {"oe"}
            "ü" {"ue"}
            "ß" {"ss"}
        }

    if ($shouldCapitalizeAll) {
        $replacement.ToUpper()
    } elseif ($wasCapitalized) {
        ''   $replacement.Substring(0,1).ToUpper()   $replacement.Substring(1)
    } else {
        $replacement
    }
})

As the answer above demonstrates, the reason that an evaluator is helpful is that an evaluator makes it easy to do a replacement that depends on the surrounding context of the match.

Running the code above produces this list, which seems to line up with your desired experience:

UENLUE
Uenlue
SCHNEEWEISS
Schneeweiss
Gessl
GESSL
Josef Oebinger
Juergen MUELLER

The only other additional note is that I ended up using the context of both the preceding and following characters when determining if the letter pairing should be capitalized.