Regex: Match all characters in between an underscore and a period-CodePudding

I have a set of file names in which I need to extract their dates. The file names look like:

['1 120836_1_20210101.csv',
 '1 120836_1_20210108.csv',
 '1 120836_20210101.csv',
 '1 120836_20210108.csv',
 '10 120836_1_20210312.csv',
 '10 120836_20210312.csv',
 '11 120836_1_20210319.csv',
 '11 120836_20210319.csv',
 '12 120836_1_20210326.csv',
 ...
]

As an example, I would need to extract 20210101 from the first item in the list above.

Here is my code but it is not working - I'm not totally familiar with regex.

import re
dates = []
for file in files:
    dates.extend(re.findall("(?<=_)\d{}(?=\d*\.)", file))

CodePudding user response：

You weren't that far off, but there were a few issues:

you extend dates by the result of the .findall, but you only expect to find one and are constructing all of dates, so that would be a lot simpler with a re.search in a list comprehension
your regex has a few unneeded complications (and some bugs)

This is what you were after:

import re

files = [
    '1 120836_1_20210101.csv',
    '1 120836_1_20210108.csv',
    '1 120836_20210101.csv',
    '1 120836_20210108.csv',
    '10 120836_1_20210312.csv',
    '10 120836_20210312.csv',
    '11 120836_1_20210319.csv',
    '11 120836_20210319.csv',
    '12 120836_1_20210326.csv'
]

dates = [re.search(r"(?<=_)\d (?=\.)", fn).group(0) for fn in files]

print(dates)

Output:

['20210101', '20210108', '20210101', '20210108', '20210312', '20210312', '20210319', '20210319', '20210326']

It keeps the lookbehind for an underscore, and changes the lookahead to look for a period. It just matches all digits (at least one, with ) in between the two.

Note that the r in front of the string avoids having to double up the backslashes in the regex, the backslashes in \d and \. are still required to indicate a digit and a literal period.