Good morning,
I have a series of file names in the form 52798687KF_12712320CP.txt
, from which I extract four substrings, namely 52798687
, KF
, 12712320
, and CP
.
At present, I get those element through a sequence of rough split operations:
s = '52798687KF_12712320CP.txt'
f1 = s.split('_')[0][:-2])
f2 = s.split('_')[0][-2:])
f3 = s.split('_')[1][:-6])
f4 = s.split('_')[1][-6:-4])
I would like to achieve the same result with a single statement, resorting to a regular expression, since, as explained below, the name structure may vary with certain criteria.
However I got stuck, since I'm not able to compose the suitable sintax; after different attempts I came up with this partial solution:
import re
s = '52798687KF_12712320CP.txt'
reg = r"(?<=\d)(?=\D)|(_)|(.[a-z]{3})|(?=\d).(?<=\D)"
x = re.split(reg, s)
But it results in a list with too many elements:
['52798687', None, None, 'KF', '_', None, '12712320', None, None, 'CP', None, '.txt', '']
Whereas I want a list containing:
['52798687', 'KF', '12712320', 'CP']
Some details about each element:
- at least one digit;
- two letters, between the last digit and the underscore;
- at least one alphanumeric character;
- two letters ahead of the extension period.
Thank you ever so much!
CodePudding user response:
As your third group can consist of a mix of alphanumerical characters, I'd do the following based on your 4-points list:
import re
reg = re.compile(r"(?i)^(\d )([a-z]{2})_([a-z\d] )([a-z]{2})\.")
s = "1AA_A1AAA.txt" # sample input
m = reg.match(s)
if m:
print(m.groups()) # ('1', 'AA', 'A1A', 'AA')
CodePudding user response:
You can try the following regular expression solution:
import re
s = '52798687KF_12712320CP.txt'
print(re.findall(r"[^\W\d_] |\d ", s.split(".")[0]))
Output:
['52798687', 'KF', '12712320', 'CP']