I have the following dataframe:
Data
0 12/25/2020
1 10/25/2020
2 2020-09-12 00:00:00
3 2020-12-09 00:00:00
I'm using the following (python) code to extract the first two potential numbers to represent a month:
df['Data'].apply(lambda x: re.match('.*([1-2][0-9]{3})', x).group(1))
However, it returns a NaN dataframe. When i test it in regex101, it works (link: https://regex101.com/r/QpacQ0/1). So, i have two questions:
- Is there a better way to work with dates from an user input? I mean, i'm building a script to recognize by position and then convert to a datetime object.
- And second, why can't this code recognize the months?
CodePudding user response:
You need to use
df['Month'] = df['Data'].str.extract(r'\b(0[1-9]|1[0-2])\b')
When using re.match('.*([1-2][0-9]{3})', x)
, you actually match any zero or more chars other than line break chars, as many as possible, from the start of string (since re.match
only searches for a match at the start of string) and then capture 1
or 2
digit and then any three digits. So, you actually match the last occurrence of a specific 4-digit sequence, not a month-like number.
With .str.extract(r'\b(0[1-9]|1[0-2])\b')
, you extract the first occurrence of 0
followed with a non-zero digit, or 1
followed with 0
, 1
or 2
, as whole word, due to \b
word boundaries.
Here is the regex demo.
If the Data
is not a string column, convert it into one:
df['Month'] = df['Data'].astype(str).str.extract(r'\b(0[1-9]|1[0-2])\b')
# ^^^^^^^^^^^^