I have a set of file names in which I need to extract their dates. The file names look like:
['1 120836_1_20210101.csv',
'1 120836_1_20210108.csv',
'1 120836_20210101.csv',
'1 120836_20210108.csv',
'10 120836_1_20210312.csv',
'10 120836_20210312.csv',
'11 120836_1_20210319.csv',
'11 120836_20210319.csv',
'12 120836_1_20210326.csv',
...
]
As an example, I would need to extract 20210101
from the first item in the list above.
Here is my code but it is not working - I'm not totally familiar with regex.
import re
dates = []
for file in files:
dates.extend(re.findall("(?<=_)\d{}(?=\d*\.)", file))
CodePudding user response:
You weren't that far off, but there were a few issues:
- you extend
dates
by the result of the.findall
, but you only expect to find one and are constructing all ofdates
, so that would be a lot simpler with are.search
in a list comprehension - your regex has a few unneeded complications (and some bugs)
This is what you were after:
import re
files = [
'1 120836_1_20210101.csv',
'1 120836_1_20210108.csv',
'1 120836_20210101.csv',
'1 120836_20210108.csv',
'10 120836_1_20210312.csv',
'10 120836_20210312.csv',
'11 120836_1_20210319.csv',
'11 120836_20210319.csv',
'12 120836_1_20210326.csv'
]
dates = [re.search(r"(?<=_)\d (?=\.)", fn).group(0) for fn in files]
print(dates)
Output:
['20210101', '20210108', '20210101', '20210108', '20210312', '20210312', '20210319', '20210319', '20210326']
It keeps the lookbehind for an underscore, and changes the lookahead to look for a period. It just matches all digits (at least one, with
) in between the two.
Note that the r
in front of the string avoids having to double up the backslashes in the regex, the backslashes in \d
and \.
are still required to indicate a digit and a literal period.