I want to check for date values present in which column of dataframe and convert the column to datetime because column type can be object initially, but dates can be in any format as below. So I am looking for a regex pattern which will match all date type formats.
- 04/10/2022
- 10/04/2022
- 2022/04/10
- 2022/10/04
- 2022-12-20 00:00:00
- 04-10-2022
Can someone please suggest a regex pattern which will match all date formats?
I have tried below code:
for columnIndex, colName in enumerate(df):
df2 = pd.DataFrame()
df2['test'] = df[colName]
count = 0
for i, j in df2.iteritems():
for k in j:
if re.match("[0-9]{2}/[0-9]{2}/[0-9]{4}", str(k)):
count = count 1
if(count>5):
df[colName] = pd.to_datetime(df[colName])
print(df.dtypes)
CodePudding user response:
Considering the following dataframe df
with all date formats indicated by OP in the question
df = pd.DataFrame({'date': ['04/10/2022', '10/04/2022', '2022/04/10', '2022/10/04', '2022-12-20 00:00:00', '04-10-2022']})
[Out]:
date
0 04/10/2022
1 10/04/2022
2 2022/04/10
3 2022/10/04
4 2022-12-20 00:00:00
5 04-10-2022
Assuming the goal is to convert to datetime, one can use pandas.to_datetime
. This has the parameter infer_datetime_format
that one can use as follows
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
[Out]:
date
0 2022-04-10
1 2022-10-04
2 2022-04-10
3 2022-10-04
4 2022-12-20
5 2022-04-10
For this case, it does the work.
Note:
- If one wants to explore the source code to see how the function is implemented, check the Github here.
CodePudding user response:
Why not simply use pandas.to_datetime
without providing any format ?
for col in df.columns:
df[col] = pd.to_datetime(df[col])
# Output :
print(df)
Col1 Col2 Col3 Col4
0 2022-04-10 NaT NaT NaT
1 NaT 2022-10-04 NaT NaT
2 NaT NaT 2022-04-10 NaT
3 2022-10-04 NaT NaT NaT
4 NaT NaT NaT 2022-12-20
5 2022-04-10 NaT NaT NaT
# Input used :
Col1 Col2 Col3 Col4
0 04/10/2022 NaN NaN NaN
1 NaN 10/04/2022 NaN NaN
2 NaN NaN 2022/04/10 NaN
3 2022/10/04 NaN NaN NaN
4 NaN NaN NaN 2022-12-20 00:00:00
5 04-10-2022 NaN NaN NaN
CodePudding user response:
Here is an idea. With this code you will match all the formats, however you can't distinguish between days and month if the date is, say 05/05/2022
. But that is an issue that goes beyond the scope of the question.
The regexp I came up with looks for groups of one or more numbers [0-9]
separated by either the dash or the slash '[/-]', and I used the backslash to escape the special symbols.
dates="""04/10/2022
10/04/2022
2022/04/10
2022/10/04
2022-12-20 00:00:00
04-10-2022"""
import re
dre = re.compile(r"([0-9] )[\/\-]([0-9] )[\/\-]([0-9] )")
for date in dates.split("\n"):
m = dre.match(date)
print( m.group(1) , m.group(2) , m.group(3) )