Use regex for extracting month in a column

I have the following dataframe:

                  Data
0           12/25/2020
1           10/25/2020
2  2020-09-12 00:00:00
3  2020-12-09 00:00:00

I'm using the following (python) code to extract the first two potential numbers to represent a month:

df['Data'].apply(lambda x: re.match('.*([1-2][0-9]{3})', x).group(1))

However, it returns a NaN dataframe. When i test it in regex101, it works (link: https://regex101.com/r/QpacQ0/1). So, i have two questions:

Is there a better way to work with dates from an user input? I mean, i'm building a script to recognize by position and then convert to a datetime object.
And second, why can't this code recognize the months?

CodePudding user response：

You need to use

df['Month'] = df['Data'].str.extract(r'\b(0[1-9]|1[0-2])\b')

When using re.match('.*([1-2][0-9]{3})', x), you actually match any zero or more chars other than line break chars, as many as possible, from the start of string (since re.match only searches for a match at the start of string) and then capture 1 or 2 digit and then any three digits. So, you actually match the last occurrence of a specific 4-digit sequence, not a month-like number.

With .str.extract(r'\b(0[1-9]|1[0-2])\b'), you extract the first occurrence of 0 followed with a non-zero digit, or 1 followed with 0, 1 or 2, as whole word, due to \b word boundaries.

Here is the regex demo.

If the Data is not a string column, convert it into one:

df['Month'] = df['Data'].astype(str).str.extract(r'\b(0[1-9]|1[0-2])\b')
#                       ^^^^^^^^^^^^