Add a space after a word if it's at the beginning of a string or if it's after one or more-CodePudding

import re

line = "treinta y un"       #example 1
line = "veinti un "         #example 2
line = "un"                 #example 3
line = "un "                #example 4
line = "uno"                #example 5
line = "treinta yun"        #example 6
line = "treinta y unghhg"   #example 7

re_for_identificate_1 = "(?<!^)un"
re_for_identificate_2 = " un"

line = re.sub(re_for_identificate_2, " un ", line)
line = re.sub(re_for_identificate_1, "un ", line)

print(repr(line))

How to obtain this outputs from those inputs?

"treinta y un "       #for example 1
"veinti un "          #for example 2
"un "                 #for example 3
"un "                 #for example 4
"uno"                 #for example 5
"treinta yun"         #for example 6
"treinta y unghhg"    #for example 7

Note that for examples 4, 5, 6 and 7 the regex should not make any changes, since after the word there is already a space placed, or because in the case of "uno", the word "un" is not at the end of the sentence, or in the case of "treinta yun" the substring "un" is not preceded by one or more spaces.

CodePudding user response：

If you want to use regex, you can use \bun$, which checks that the last whole word in the string is un, and that there is nothing after it in the string. If that is the case, a space is added to the end of the string:

import re

lines = ["treinta y un", "veinti un ", "un", "un ",
         "uno", "treinta yun", "treinta y unghhg"]

result = [re.sub(r'\bun$', 'un ', line) for line in lines]

Output:

[
 'treinta y un ',
 'veinti un ',
 'un ',
 'un ',
 'uno',
 'treinta yun',
 'treinta y unghhg'
]

CodePudding user response：

I'm not sure you need regular expressions. The following code appears to achieve what you want.

Three checks are performed:

The content is a string
The last two characters are "un"
The last word is "un"

Here I've wrapped the logic in a list comprehension to demonstrate.

lines = ["treinta y un", "veinti un ", "un", "un ",
         "uno", "treinta yun", "treinta y unghhg"]

result = [ line " " if (isinstance(line, str) 
                    and (line[-2:] == "un") 
                    and (line.split()[-1] == "un"))
          else line 
          for line in lines ]

for line in result:
    print(f"'{line}'")

Output:

'treinta y un '
'veinti un '
'un '
'un '
'uno'
'treinta yun'
'treinta y unghhg'

CodePudding user response：

If you declare the line = like that in you code, you will overwrite it each time.

Using (?<!^)un asserts not the start of the string directly to the left.

If you also want to exclude a match for #un you can use (?<\S) instead asserting a whitspace boundary to the left.

To make sure the pattern is at the end of the string, you can use the anchor $

The code example uses single lines, but if you want to do the replacement when having multiple lines, you have to use the multiline flag re.MULTILINE with re.sub.

Example

import re

pattern = r"(?<!\S)un$"

lines = ["treinta y un", "veinti un ", "un", "un ",
         "uno", "treinta yun", "treinta y unghhg", "#un"]

print([re.sub(pattern, 'un ', line) for line in lines])

Output

[
  'treinta y un ',
  'veinti un ',
  'un ',
  'un ',
  'uno',
  'treinta yun',
  'treinta y unghhg',
  '#un'
]