I need to extract alpha numeric code from various fields in data. The code could be in any of three fields. I need the first match -- then extract the code with a substitution/capture group to populate the "code" column.
In the toy example below, the output is what I would expect ( AKA: it "works" ). In my production version, it skips obvious matches ( not all, small number ) and on another small number, the substitution results in smushed results or a full return of the matched field.
Toy data:
df3 = pd.DataFrame([[1000, 'File X234 version 2.pdf', 'My Title 8209','BR1',''],
[1001, 'File_X003.pdf', 'Title X003', 'BR1',''],
[1003, 'File.pdf', 'BR3 8200', 'BR2',''],
[1004, 'BR5_file.doc','BR4 F200','BR1',''],
[1005, 'file.txt', 'Title', 'BR1', ''],
[1006, '8208 doc3.txt', 'doc3', 'BR4', '']],
columns=['ID', 'File Name', 'Title', 'Type','Code'])
REGEX paterns with capture groups:
patternA = re.compile(r'^.*(82[0-9]{2}).*$', re.IGNORECASE)
patternB = re.compile(r'^.*(X[0-9]{3}).*$', re.IGNORECASE)
patternC = re.compile(r'^.*(F[0-9]{3}).*$', re.IGNORECASE)
Function very similar to production version:
def func_bar(x):
text = x['File Name'] x['Title']
if patternA.search(text):
value = patternA.sub(r"\1", text)
return value
elif patternB.search(text):
value = patternB.sub(r'\1', text)
return value
elif patternC.search(text):
value = patternC.sub(r'\1', text)
return value
else: return "Not Found"
Apply function to column:
df3['Code'] = df3.apply(func_bar, axis=1)
Success(?):
ID File Name Title Type Code
0 1000 File X234 version 2.pdf My Title 8209 BR1 8209
1 1001 File_X003.pdf Title X003 BR1 X003
2 1003 File.pdf BR3 8200 BR2 8200
3 1004 BR5_file.doc BR4 F200 BR1 F200
4 1005 file.txt Title BR1 Not Found
5 1006 8208 doc3.txt doc3 BR4 8208
Any ideas why my production version is:
- skipping just a few obvious matches?
- Smushing output ( only a few )
- Returning the whole matched field?
CodePudding user response:
You can concatenate the necessary columns into a code
column, and then extract with a single regex:
df3['code'] = df3['File Name'] df3['Title'] # add more columns if needed
df3['code'] = df3['code'].str.extract(r'((?:82|[XF]\d)\d{2})').fillna("Not found")
Output:
")
>>> df3
ID File Name Title Type Code code
0 1000 File X234 version 2.pdf My Title 8209 BR1 X234
1 1001 File_X003.pdf Title X003 BR1 X003
2 1003 File.pdf BR3 8200 BR2 8200
3 1004 BR5_file.doc BR4 F200 BR1 F200
4 1005 file.txt Title BR1 Not found
5 1006 8208 doc3.txt doc3 BR4 8208
The ((?:82|[XF]\d)\d{2})
regex matches
(?:82|[XF]\d)
-82
, orX
orF
and a digit\d{2}
- two digits.
The Series.str.extract
returns the first match found.
If you want to use several regexps just OR them with the |
alternation operator:
regexps = [r'82[0-9]{2}', r'X[0-9]{3}', r'F[0-9]{3}'] # etc.
df3['code'] = df3['File Name'] df3['Title'] # add more columns if needed
df3['code'].str.extract(f"({'|'.join(regexps)})").fillna("Not found")