Pandas: regex search function skipping matches and returning incorrect results-CodePudding

I need to extract alpha numeric code from various fields in data. The code could be in any of three fields. I need the first match -- then extract the code with a substitution/capture group to populate the "code" column.

In the toy example below, the output is what I would expect ( AKA: it "works" ). In my production version, it skips obvious matches ( not all, small number ) and on another small number, the substitution results in smushed results or a full return of the matched field.

Toy data:

df3 = pd.DataFrame([[1000, 'File X234 version 2.pdf', 'My Title 8209','BR1',''], 
                    [1001, 'File_X003.pdf', 'Title X003', 'BR1',''], 
                    [1003, 'File.pdf', 'BR3 8200', 'BR2',''],
                    [1004, 'BR5_file.doc','BR4 F200','BR1',''],
                    [1005, 'file.txt', 'Title', 'BR1', ''],
                    [1006, '8208 doc3.txt', 'doc3', 'BR4', '']],
                   columns=['ID', 'File Name', 'Title', 'Type','Code'])

REGEX paterns with capture groups:

patternA   = re.compile(r'^.*(82[0-9]{2}).*$', re.IGNORECASE)
patternB   = re.compile(r'^.*(X[0-9]{3}).*$', re.IGNORECASE)
patternC   = re.compile(r'^.*(F[0-9]{3}).*$', re.IGNORECASE)

Function very similar to production version:

def func_bar(x):
    text = x['File Name']   x['Title']

    if patternA.search(text): 
        value = patternA.sub(r"\1", text)
        return value
    elif patternB.search(text): 
        value = patternB.sub(r'\1', text)
        return value
    elif patternC.search(text): 
        value = patternC.sub(r'\1', text)
        return value
    else: return "Not Found"

Apply function to column:

df3['Code'] = df3.apply(func_bar, axis=1)

Success(?):

    ID          File Name                   Title           Type    Code
0   1000        File X234 version 2.pdf     My Title 8209   BR1     8209
1   1001        File_X003.pdf               Title X003      BR1     X003
2   1003        File.pdf                    BR3 8200        BR2     8200
3   1004        BR5_file.doc                BR4 F200        BR1     F200
4   1005        file.txt                    Title           BR1     Not Found
5   1006        8208 doc3.txt               doc3            BR4     8208

Any ideas why my production version is:

skipping just a few obvious matches?
Smushing output ( only a few )
Returning the whole matched field?

CodePudding user response：

You can concatenate the necessary columns into a code column, and then extract with a single regex:

df3['code'] = df3['File Name']   df3['Title'] # add more columns if needed
df3['code'] = df3['code'].str.extract(r'((?:82|[XF]\d)\d{2})').fillna("Not found")

Output:

")
>>> df3
     ID                File Name          Title Type Code       code
0  1000  File X234 version 2.pdf  My Title 8209  BR1            X234
1  1001            File_X003.pdf     Title X003  BR1            X003
2  1003                 File.pdf       BR3 8200  BR2            8200
3  1004             BR5_file.doc       BR4 F200  BR1            F200
4  1005                 file.txt          Title  BR1       Not found
5  1006            8208 doc3.txt           doc3  BR4            8208

The ((?:82|[XF]\d)\d{2}) regex matches

(?:82|[XF]\d) - 82, or X or F and a digit
\d{2} - two digits.

The Series.str.extract returns the first match found.

If you want to use several regexps just OR them with the | alternation operator:

regexps = [r'82[0-9]{2}', r'X[0-9]{3}', r'F[0-9]{3}'] # etc.
df3['code'] = df3['File Name']   df3['Title'] # add more columns if needed
df3['code'].str.extract(f"({'|'.join(regexps)})").fillna("Not found")