Regex use to split a string with an OPTIONAL SEPARATOR (char)-CodePudding

I have a string that looks like this: new_tt_j1213.

I need to split it into 'new', 'tt', 'j1213'.

string rules:

The first three characters are always letters
after the first '_' (if it is present) there are 2 letters only
after the second '_' (if it is present) there are EITHER t1524 OR t014 formats (the first symbol is always a letter and there can be 3 or 4 digits)
all letters are lower case
'_' can be missing due to corruption

It is very easy with the split() method BUT because data is corrupted sometimes the first underscore is missing: 'newtt_j1213'

Since I am very new to Regex, can you help me to adjust my code below so the split works as described even without the first '_'. (or even without the 2nd underscore)?

 p1 =''
 p2 =''
 p3 =''

str_t = 'newtt_j123'
str_t1 = 'new_tt_j1213'
str_t2 = 'newttj1213'
str3 = 'new_ttj1213'

#   Test for data Corruption
 tr_lst = re.split('_', str_t)   
 if len(tr_lst) <3: print('DATA CORRUPTION - ',tr_lst)     
    
 p1, p2, p3 = re.split('_', str_t) # THIS LINE NEEDS ADJUSTMENT (REGEX?)
 print(p1, p2, p3, str_t)

Thank you!!

CodePudding user response：

Assuming your strings are in the form "ABB_CCCCC" or "A_BB_CCCCC" and you want to extract A/BB/CCCCC.

You could use:

import re
re.findall('(.)_?([^_] )_(. )', your_string)[0]

NB1. If you always have 1 character, 2 characters, 5 characters, use: '(.)_?(..)_(.{5})'

NB2. this assumes here that the strings are valid for one or the other formula, else you will have no match and an IndexError

example:

s1 = 'n_tt_j1213'
re.findall('(.)_?([^_] )_(. )', your_string)[0]

s2 = 'ntt_j1213'
re.findall('(.)_?([^_] )_(. )', your_string)[0]

output for both:

('n', 'tt', 'j1213')

CodePudding user response：

From what i understand, you can do it without regex.

First remove all _ if present with s.replace('_', '') Next use bracket to extract like this :

s = "newtt_j1213"
s = s.replace('_', '')
ss = [ s[0:3], s[3:5], s[5:] ]
print(ss)

Output: ['new', 'tt', 'j1213']

CodePudding user response：

This is how I would approach this:

import re

regex = re.compile('^([a-z]{3}).*([a-z]{2}).*([a-z]{1}\d{3})$')
matches = re.search(regex, 'newtt_j123')
print(matches.groups())

OUT: ('new', 'tt', 'j123')

The regex works like this: ^([a-z]{3}) anchors the match to the start of the string and then matches 3 lowercase letters and stores this in the first capture group. The [a-z] is a character class which means any lowercase letters, the 3 in curly brackets mean it has to be of length 3. It's in a capture group because it's in normal brackets (i.e. () ).

Then greedily match anything with .* up to the next capture group and([a-z]{2}) any two lowercase letters.

Then again greedily match anything up to to the last capture group with ([a-z]{1}\d{3}). This matches one lowercase letter next to three numbers, which is denoted by \d.

Then when you print out matches.groups() you will get the matches in that order, i.e. the first one will be the three letters, the next one will be the next two letters, and the last one will be the letter followed by three numbers.

I also used ^ and $ to anchor the matches to the start and end of the string.