Home > database >  Switching multi-language substring positions using regex
Switching multi-language substring positions using regex

Time:11-02

Raw input with lithuanian letters:

Ą.BČ
Ą.BČ D Ę
Ą. BČ
Ą. BČ D Ę
Ą BČ
Ą BČ D Ę
Examples below should not be affected.
ĄB ČD DĘ

Expected result:

BČ Ą.
BČ Ą. D Ę
BČ Ą. 
BČ Ą. D Ę
BČ Ą 
BČ Ą D Ę
ĄB ČD DĘ

What I've tried:

^(.\.? *)([\p{L}\p{N}\p{M}]*)$
With ReplaceAllString substitution like so
$2 $1

I have tried various patterns but this is the best I could come up for now. It manages to capture 1st, 3rd and 5th line and successfully substitute like so: (Except for some extra spaces at the end of lines)

BČ Ą.
Ą.BČ D Ę
BČ Ą. 
Ą. BČ D Ę
BČ Ą 
Ą BČ D Ę
ĄB ČD DĘ

Explanation:

There is a set of data with varying entries of the underlying basic structure [FIRST NAME FIRST LETTER][LASTNAME] which I want to ideally bring to [LASTNAME][SPACE][FIRST NAME FIRST LETTER][DOT]?

Link to regex101: regex101

Final solution:

^([\p{L}\p{N}\p{M}](?:\. *|  ))([\p{L}\p{N}\p{M}] )
    With ReplaceAllString substitution like so
    $2 $1

CodePudding user response:

For your example data, you can omit the anchor $ and match either a dot followed by optional spaces, or 1 or more spaces.

To prevent an empty match for the character class, you can repeat it 1 or more times using instead of *

^(.(?:\. *|  ))([\p{L}\p{N}\p{M}] )

See a regex demo

Note that the . can match any char including a space. You might also change the dot to a single [\p{L}\p{N}\p{M}]

  • Related