I have an extremely large dataset with date/time columns with various formats. I have a validation function to detect the possible date/time string formats that can handle handle 24 hour time as well as 12 hour. The seperator is always :
. A sample of the is below. However, after profiling my code, it seems this can become a bottleneck and expensive in terms of the execution time. My question is if there is a better way to do this without affecting the performance.
import datetime
def validate_time(time_str: str):
for time_format in ["%H:%M", "%H:%M:%S", "%H:%M:%S.%f", "%I:%M %p"]:
try:
return datetime.datetime.strptime(time_str, time_format)
except ValueError:
continue
return None
print(validate_time(time_str="9:21 PM"))
CodePudding user response:
Instead of trying to parse using every format string, you could split by colons to obtain the segments of your string that denote hours, minutes, and everything that remains. Then you can parse the result depending on the number of values the split returns:
def validate_time_new(time_str: str):
time_vals = time_str.split(':')
try:
if len(time_vals) == 1:
# No split, so invalid time
return None
elif len(time_vals) == 2:
if time_vals[-1][::-2].lower() in ["am", "pm"]:
# if last element contains am or pm, try to parse as 12hr time
return datetime.datetime.strptime(time_str, "%I:%M %p")
else:
# try to parse as 24h time
return datetime.datetime.strptime(time_str, "%H:%M")
elif len(time_vals) == 3:
if "." in time_vals[-1]:
# If the last element has a decimal point, try to parse microseconds
return datetime.datetime.strptime(time_str, "%H:%M:%S.%f")
else:
# try to parse without microseconds
return datetime.datetime.strptime(time_str, "%H:%M:%S")
else: return None
except ValueError:
# If any of the attempts to parse throws an error, return None
return None
To test this, let's time both methods for a bunch of test strings:
import timeit
print("old\t\t\tnew\t\t\t\told/new\t\ttest_string")
for s in ["12:24", "12:23:42", "13:53", "1:53 PM", "12:24:43.220", "not a date", "54:23:21"]:
t1 = timeit.timeit('validate_time(s)', 'from __main__ import datetime, validate_time, s', number=100)
t2 = timeit.timeit('validate_time_new(s)', 'from __main__ import datetime, validate_time_new, s', number=100)
print(f"{t1:.6f}\t{t2:.6f}\t\t{t1/t2:.6f}\t\t{s}")
old new old/new test_string
0.001628 0.001143 1.424322 12:24
0.001567 0.001012 1.548661 12:23:42
0.000935 0.000979 0.955177 13:53
0.003004 0.000722 4.161657 1:53 PM
0.004523 0.001396 3.241204 12:24:43.220
0.002148 0.000025 84.897370 not a date
0.002262 0.000622 3.638629 54:23:21