I have the following type of strings: "CanadaUnited States", "GermanyEnglandSpain"
I want to split them into the countries' names, i.e.:
['Canada', 'United States'] ['Germany', 'England', 'Spain']
I have tried using the following regex:
text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)
and I'm getting:
['German', 'Englan', 'Spain']
How can I not lose the last char in every word?] Thanks!
CodePudding user response:
I would use re.findall
here with a regex find all approach:
inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z] (?: [A-Z][a-z] )*', inp)
print(countries) # ['Canada', 'United States']
The regex pattern used here says to match:
[A-Z][a-z]
match a leading uppercase word of a country name(?: [A-Z][a-z] )*
followed by space and another capital word, 0 or more times
CodePudding user response:
You can use re.split
with capture groups like so, but then you will also need to filter out the empty delimeters:
import re
text = "GermanyEnglandSpain"
res = re.split('([A-Z][a-z]*)', text)
res = list(filter(None, res))
print(res)
CodePudding user response:
My answer is longer than Tim's because I wanted to include more cases to the problem so that you can change it as you need it. You can shorten it by using lambda functions and putting multiple regex into one
Basic flow: add a space before every upper letter, replace multiple spaces with *, split on single spaces, and replace * with single space
import re
text = "GermanyUnited StatesEnglandUnited StatesSpain"
text2=re.sub('([A-Z])', r' \1', text) #adds a single space before every upper letter
print(text2)
#Germany United States England United States Spain
text3=re.sub('\s{2,}', '*', text2)#replaces 2 or more spaces with * so that we can replace later
print(text3)
#Germany United*States England United*States Spain
text4=re.split(' ',text3)#splits the text into list on evert single space
print(text4)
#['', 'Germany', 'United*States', 'England', 'United*States', 'Spain']
text5=[]
for i in text4:
text5.append(re.sub('\*', ' ', i)) #replace every * with a single space
text5=list(filter(None, text5)) #remove empty elements
print(text5)
#['Germany', 'United States', 'England', 'United States', 'Spain']