I have strings that look like this:
FF0MASTSDRatingUnitState20211103105132-12991140
AA000MASTERNewWord10111103102344-22991111
I want to extract the words in between the capital letter sequences and the numbers. For example:
RatingUnitState
and NewWord
. However, it's not that straightforward since the these words also begin with a capital letter.
Is it possible to first discard the initial capital letters seuqence (maybe by checking if the proceeding letter is capital as well or not. If yes, it means the current letter is a part of the capital letter sequence. If not, then it would mean that the current letter is a part of the next camel case sequence which i want to extract).
How could I translate this idea into code?
CodePudding user response:
How could I translate this idea into code?
You might use regular expression feature called negative lookahead for this task as follows
import re
text1 = "FF0MASTSDRatingUnitState20211103105132-12991140"
text2 = "AA000MASTERNewWord10111103102344-22991111"
match1 = re.search(r'[A-Z](?![A-Z])[A-Za-z] ',text1)
match2 = re.search(r'[A-Z](?![A-Z])[A-Za-z] ',text2)
print(match1.group(0)) # RatingUnitState
print(match2.group(0)) # NewWord
Disclaimer: I assume you deal only with ASCII letters. If you want to know more read about negative lookahead assertion in python
read re
docs, if you want more general discussion search for zero length assertion
CodePudding user response:
Your logic should be something like this.
text = "FF0MASTSDRatingUnitState20211103105132-12991140"
small = "abcdefghijlkmnopqrstuvwxyz"
num = "0123456789"
index0 = -1
index1 = -1
foundSmall = False
for i in text:
if (i in small) and (foundSmall != True):
index0 = text.index(i) - 1
foundSmall = True
if(i in num) and foundSmall == True:
index1 = text.index(i)
break;
print(text[index0:index1])