I want to extract two pieces of information from a list of filenames using regex.
The two numbers are always located just before the file extension and separated by a dash.
In the example filename below, I would aim for 10 and 11.
To get the digit just before the extension (so 11 in the example), I am using \d (?=.raw)
and it seems to work.
However, I struggle to get something similar for the number before (so 10 in this example).
D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\
The expression is to be used to create a new column in a pandas dataframe in the following way: df['y'] = df['Filename'].apply(lambda x: re.findall('\d (?=.raw)', x)[0])
CodePudding user response:
You could try as follows:
import pandas as pd
import re
data = {'Filename': ['something-1-2.csv','something-10-11.raw']}
df = pd.DataFrame(data)
pattern = r'(\d )-(\d (?=\.. $))'
df['y'] = df['Filename'].apply(lambda x: re.findall(pattern,x)[0])
print(df)
Filename y
0 something-1-2.csv (1, 2)
1 something-10-11.raw (10, 11)
# or if you want to split them in different cols immediately, try:
df[['y1','y2']] = df['Filename'].apply(lambda x: re.findall(pattern,x)[0]).tolist()
print(df)
Filename y1 y2
0 something-1-2.csv 1 2
1 something-10-11.raw 10 11
CodePudding user response:
Try to utilize str.findall()
instead:
import pandas as pd
df = pd.DataFrame({'Filename': [r'D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\'']})
df['y'] = df['Filename'].str.findall(r'\d (?=(?:-\d )?\.[^.] $)')
print(df)
Prints:
Filename y
0 D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatte... [10, 11]
Pattern used:
\d (?=(?:-\d )?\.[^.] $)
See an online demo
\d
- 1 Digits;(?=
- Open positive lookahead;(?:-\d )?
- Optional non-capture group to match hyphen and 1 digits;\.[^.] $
- Literal dot followed by 1 non-dots and end-line anchor.
EDIT:
Or, if you must put these in seperate columns, we can use str.extract()
:
import pandas as pd
df = pd.DataFrame({'Filename': [r'D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatten-sliced\20210730-HK-S-006-PLATE-flatten-10-11.raw\'']})
df[['y1','y2']] = df['Filename'].str.extract(r'(\d )-(\d )\.[^.] $')
print(df)
Prints:
Filename y1 y2
0 D:\CDTFlatten02\20210730-HK-S-006-PLATE-flatte... 10 11
Where the main difference in the pattern is that we got rid of the lookahead and instead used two capture groups to grab the numbers.