Home > Enterprise >  Regex - Match a string up to a digit or a specific string
Regex - Match a string up to a digit or a specific string

Time:12-26

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:

Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8

The part that I want to capture would look like this:

Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia

The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-] . However, this leaves country names that are followed by a text in parentheses with a trailing white space.

This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (

I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.

CodePudding user response:

You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick

In [1]: import re

In [2]: pattern = re.compile(r'(. (?=\d| \()|. )')

In [3]: data = """Argentina
   ...: Australia1
   ...: Bolivia (Plurinational State of)
   ...: China, Hong Kong Special Administrative Region
   ...: Côte d'Ivoire
   ...: Curaçao
   ...: Guinea-Bissau
   ...: Indonesia8""".splitlines()

In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
 'Australia',
 'Bolivia',
 'China, Hong Kong Special Administrative Region',
 "Côte d'Ivoire",
 'Curaçao',
 'Guinea-Bissau',
 'Indonesia']

CodePudding user response:

I think you can try ^([^\d \n]| [^\d (\n]) or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n]) (The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)

Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.

  • Related