Home > database >  splitting a text by a capital letter after a small letter, without loosing the small letter
splitting a text by a capital letter after a small letter, without loosing the small letter

Time:11-28

I have the following type of strings: "CanadaUnited States", "GermanyEnglandSpain"

I want to split them into the countries' names, i.e.:

['Canada', 'United States'] ['Germany', 'England', 'Spain']

I have tried using the following regex:

text = "GermanyEnglandSpain"
re.split('[a-z](?=[A-Z])', text)

and I'm getting: ['German', 'Englan', 'Spain']

How can I not lose the last char in every word?] Thanks!

CodePudding user response:

I would use re.findall here with a regex find all approach:

inp = "CanadaUnited States"
countries = re.findall(r'[A-Z][a-z] (?: [A-Z][a-z] )*', inp)
print(countries)  # ['Canada', 'United States']

The regex pattern used here says to match:

  • [A-Z][a-z] match a leading uppercase word of a country name
  • (?: [A-Z][a-z] )* followed by space and another capital word, 0 or more times

CodePudding user response:

You can use re.split with capture groups like so, but then you will also need to filter out the empty delimeters:

import re

text = "GermanyEnglandSpain"
res = re.split('([A-Z][a-z]*)', text)
res = list(filter(None, res))
print(res)

CodePudding user response:

My answer is longer than Tim's because I wanted to include more cases to the problem so that you can change it as you need it. You can shorten it by using lambda functions and putting multiple regex into one

Basic flow: add a space before every upper letter, replace multiple spaces with *, split on single spaces, and replace * with single space

import re
text = "GermanyUnited  StatesEnglandUnited StatesSpain"
text2=re.sub('([A-Z])', r' \1', text) #adds a single space before every upper letter
print(text2) 
#Germany United   States England United  States Spain
text3=re.sub('\s{2,}', '*', text2)#replaces 2 or more spaces with * so that we can replace later
print(text3)
#Germany United*States England United*States Spain
text4=re.split(' ',text3)#splits the text into list on evert single space
print(text4)
#['', 'Germany', 'United*States', 'England', 'United*States', 'Spain']
text5=[]

for i in text4:
  text5.append(re.sub('\*', ' ', i)) #replace every * with a single space 
text5=list(filter(None, text5)) #remove empty elements 

print(text5)
#['Germany', 'United States', 'England', 'United States', 'Spain']
  • Related