I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:
Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8
The part that I want to capture would look like this:
Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia
The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]
. However, this leaves country names that are followed by a text in parentheses with a trailing white space.
This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (
I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.
CodePudding user response:
You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick
In [1]: import re
In [2]: pattern = re.compile(r'(. (?=\d| \()|. )')
In [3]: data = """Argentina
...: Australia1
...: Bolivia (Plurinational State of)
...: China, Hong Kong Special Administrative Region
...: Côte d'Ivoire
...: Curaçao
...: Guinea-Bissau
...: Indonesia8""".splitlines()
In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
'Australia',
'Bolivia',
'China, Hong Kong Special Administrative Region',
"Côte d'Ivoire",
'Curaçao',
'Guinea-Bissau',
'Indonesia']
CodePudding user response:
I think you can try ^([^\d \n]| [^\d (\n])
or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])
(The ^
character inside []
excludes the following characters, see https://regexone.com/lesson/excluding_characters)
Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.