Extract integer in a filename from complete path using split regex in Pandas-CodePudding

Given a df

df=pd.DataFrame(['/home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__131147.png',
                 '/home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__160565.png'])

I would like to extract only the integer just before the file extension.

The code below should answer the above objective

df['fname'] =df[0].apply(lambda x : os.path.split(x)[1])
df['f'] =df['fname'].apply(lambda x : x.split('__')[1].split('.png')[0])
df['f']=df['f'].astype(int)

However, I have the impression this can be achieve easily using pandas build-in split, such as below

df['f']=df[0].str.split(re.compile(r"__\d.jpg"), expand=True)

But, it seems nothing is being split. May I know what parameter not being set correctly?

CodePudding user response：

You can use Series.str.extract:

df['num'] = df['f'].str.extract(r'_(\d )\.[^.] $', expand=False)

Test your regexes here.

Details:

_ - an underscore
(\d ) - Capturing group 1 (this is the value returned by Series.str.extract): one or more digits
\. - a . char
[^.] - one or more chars other than a . char
$ - end of string

Python test:

import pandas as pd
df = pd.DataFrame({'f':['/home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__131147.png',
    '/home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__160565.png']})
df['num'] = df['f'].str.extract(r'_(\d )\.[^.] $', expand=False)
print(df.to_string())

Output:

                                                                         f     num
0  /home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__131147.png  131147
1  /home/dtest/Documents/user/exp/S1/test1/test3/sub5/file_2_F__160565.png  160565

CodePudding user response：

Assuming 0 the name of your column (as in your example), you can use str.extract:

df[0].str.extract(r'(\d )\.[^.] $', expand=False)

output:

0    131147
1    160565
Name: 0, dtype: object

To assign to a new column:

df['f'] = df[0].str.extract(r'(\d )\.[^.] $')

regex demo

CodePudding user response：

def extract(values):
    values = values.split('__') # cut at '__'
    return int(values[-1].replace('.png','')) # take the last part en replace the .png

df[0].apply(extract)