Home > Back-end >  Get digits before file extension using regex
Get digits before file extension using regex

Time:08-04

I want to extract two pieces of information from a list of filenames using regex. The two numbers are always located just before the file extension and separated by a dash. In the example filename below, I would aim for 10 and 11. To get the digit just before the extension (so 11 in the example), I am using \d (?=.raw) and it seems to work. However, I struggle to get something similar for the number before (so 10 in this example).

D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\

The expression is to be used to create a new column in a pandas dataframe in the following way: df['y'] = df['Filename'].apply(lambda x: re.findall('\d (?=.raw)', x)[0])

CodePudding user response:

You could try as follows:

import pandas as pd
import re
data = {'Filename': ['something-1-2.csv','something-10-11.raw']}
df = pd.DataFrame(data)

pattern = r'(\d )-(\d (?=\.. $))'

df['y'] = df['Filename'].apply(lambda x: re.findall(pattern,x)[0])
print(df)

              Filename         y
0    something-1-2.csv    (1, 2)
1  something-10-11.raw  (10, 11)

# or if you want to split them in different cols immediately, try:
df[['y1','y2']] = df['Filename'].apply(lambda x: re.findall(pattern,x)[0]).tolist()
print(df)

              Filename  y1  y2
0    something-1-2.csv   1   2
1  something-10-11.raw  10  11

CodePudding user response:

Try to utilize str.findall() instead:

import pandas as pd

df = pd.DataFrame({'Filename': [r'D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\'']})
df['y'] = df['Filename'].str.findall(r'\d (?=(?:-\d )?\.[^.] $)')

print(df)

Prints:

                                          Filename         y
0  D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatte...  [10, 11]

Pattern used:

\d (?=(?:-\d )?\.[^.] $)

See an online demo

  • \d - 1 Digits;
  • (?= - Open positive lookahead;
    • (?:-\d )? - Optional non-capture group to match hyphen and 1 digits;
    • \.[^.] $ - Literal dot followed by 1 non-dots and end-line anchor.

EDIT:

Or, if you must put these in seperate columns, we can use str.extract():

import pandas as pd

df = pd.DataFrame({'Filename': [r'D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\'']})
df[['y1','y2']] = df['Filename'].str.extract(r'(\d )-(\d )\.[^.] $')

print(df)

Prints:

                                            Filename  y1  y2
0  D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatte...  10  11

Where the main difference in the pattern is that we got rid of the lookahead and instead used two capture groups to grab the numbers.

  • Related