Ho do I get a list from compounded names with strange characters in python?-CodePudding

How can I generate a list with regex in python for countries with compounded names?

names = ['Nizhniy Novgorod', 'Cần Thơ', 'Ba Beja', 'Bandar Bampung', 'Benin City', 'Ciudad Nezahualcóyotl', 'Biên Hòa', 'São Gonçalo', 'São Luís', 'New Orleans', 'Thủ Đức']

I was trying to do this but it returns all names:

import re

lst = []
for word in names:
    if re.findall(r'[A-Z]\w \b', word[0]) == re.findall(r'\b[A-Z]\w ', word[1]):
        lst.append(word)

print(lst)

Output:
['Nizhniy Novgorod', 'Cần Thơ', 'Ba Beja', 'Bandar Bampung', 'Benin City', 'Ciudad Nezahualcóyotl', 'Biên Hòa', 'São Gonçalo', 'São Luís', 'New Orleans', 'Thủ Đức']

The desired output would be [Ba Beja, Bandar Bampung].

It is an exercise that's why I can only do it with the module re. Any help will be appreciate.

CodePudding user response：

Ok - so I have two answers for you.
One that uses REGEX, and the other that doesn't.

Here is the REGEX version:

import re

names = ['Nizhniy Novgorod', 'Cần Thơ', 'Ba Beja', 'Bandar Bampung', 'Benin City', 'Ciudad Nezahualcóyotl', 'Biên Hòa', 'São Gonçalo', 'São Luís', 'New Orleans', 'Thủ Đức']

pattern = re.compile(r'^([A-zÀ-ứ])[A-zÀ-ứ]*\s\1[A-zÀ-ứ]*$')

lst = []
for line in names:
    if re.search(pattern, line):
        lst.append(line)

print(lst)

OUTPUT:

['Nizhniy Novgorod', 'Ba Beja', 'Bandar Bampung']

And here is the other answer that does not use Regex:

names = ['Nizhniy Novgorod', 'Cần Thơ', 'Ba Beja', 'Bandar Bampung', 'Benin City', 'Ciudad Nezahualcóyotl', 'Biên Hòa', 'São Gonçalo', 'São Luís', 'New Orleans', 'Thủ Đức']

lst = []
space = ' '
for line in names:
    if space in line: 
        first, second = line.split(space)
        if first[0] == second[0]:
            lst.append(line)
            
print(lst)

OUTPUT:

['Nizhniy Novgorod', 'Ba Beja', 'Bandar Bampung']