Home > Net >  Extract letters after capital letter sequence
Extract letters after capital letter sequence

Time:03-01

I have strings that look like this:

FF0MASTSDRatingUnitState20211103105132-12991140
AA000MASTERNewWord10111103102344-22991111

I want to extract the words in between the capital letter sequences and the numbers. For example: RatingUnitState and NewWord. However, it's not that straightforward since the these words also begin with a capital letter.

Is it possible to first discard the initial capital letters seuqence (maybe by checking if the proceeding letter is capital as well or not. If yes, it means the current letter is a part of the capital letter sequence. If not, then it would mean that the current letter is a part of the next camel case sequence which i want to extract).

How could I translate this idea into code?

CodePudding user response:

How could I translate this idea into code?

You might use regular expression feature called negative lookahead for this task as follows

import re
text1 = "FF0MASTSDRatingUnitState20211103105132-12991140"
text2 = "AA000MASTERNewWord10111103102344-22991111"
match1 = re.search(r'[A-Z](?![A-Z])[A-Za-z] ',text1)
match2 = re.search(r'[A-Z](?![A-Z])[A-Za-z] ',text2)
print(match1.group(0))  # RatingUnitState
print(match2.group(0))  # NewWord

Disclaimer: I assume you deal only with ASCII letters. If you want to know more read about negative lookahead assertion in python read re docs, if you want more general discussion search for zero length assertion

CodePudding user response:

Your logic should be something like this.

text = "FF0MASTSDRatingUnitState20211103105132-12991140"
small = "abcdefghijlkmnopqrstuvwxyz"
num = "0123456789"
index0 = -1
index1 = -1

foundSmall = False

for i in text:
    if (i in small) and (foundSmall != True):
        index0 = text.index(i) - 1
        foundSmall = True
        
    if(i in num) and foundSmall == True:
        index1 = text.index(i)
        break;

print(text[index0:index1]) 
  • Related