Home > Software engineering >  Python, regex to split alphanumeric string with multiple separators
Python, regex to split alphanumeric string with multiple separators

Time:06-24

Good morning,
I have a series of file names in the form 52798687KF_12712320CP.txt, from which I extract four substrings, namely 52798687, KF, 12712320, and CP.

At present, I get those element through a sequence of rough split operations:

s = '52798687KF_12712320CP.txt'

f1 = s.split('_')[0][:-2])
f2 = s.split('_')[0][-2:])
f3 = s.split('_')[1][:-6])
f4 = s.split('_')[1][-6:-4])

I would like to achieve the same result with a single statement, resorting to a regular expression, since, as explained below, the name structure may vary with certain criteria.
However I got stuck, since I'm not able to compose the suitable sintax; after different attempts I came up with this partial solution:

import re

s = '52798687KF_12712320CP.txt'
reg = r"(?<=\d)(?=\D)|(_)|(.[a-z]{3})|(?=\d).(?<=\D)"
x = re.split(reg, s)

But it results in a list with too many elements:

['52798687', None, None, 'KF', '_', None, '12712320', None, None, 'CP', None, '.txt', '']

Whereas I want a list containing:

['52798687', 'KF', '12712320', 'CP']

Some details about each element:

  1. at least one digit;
  2. two letters, between the last digit and the underscore;
  3. at least one alphanumeric character;
  4. two letters ahead of the extension period.

Thank you ever so much!

CodePudding user response:

As your third group can consist of a mix of alphanumerical characters, I'd do the following based on your 4-points list:

import re
reg = re.compile(r"(?i)^(\d )([a-z]{2})_([a-z\d] )([a-z]{2})\.")

s = "1AA_A1AAA.txt"  # sample input
m = reg.match(s)
if m:
    print(m.groups())  # ('1', 'AA', 'A1A', 'AA')

CodePudding user response:

You can try the following regular expression solution:

import re

s = '52798687KF_12712320CP.txt'
print(re.findall(r"[^\W\d_] |\d ", s.split(".")[0]))

Output:

['52798687', 'KF', '12712320', 'CP']
  • Related