I have a file with thousands of time formats. Some of them are in their standard formats, while others are followed by a comma and three digits like this:
Standard format: 00:00:44
Followed by comma and three digits: 00:00:46,235
I've removed the standard formats using the following regex:
text = re.sub(r'^((?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$)', '', text)
And that is ok. But for the time format followed by comma and three digits nothing that I've tried so far has helped me to remove them. Please, how can I remove this odd time format pattern?
CodePudding user response:
Your regex matches the standard time format.
r'^((?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$)'
Just add the comma part at the end, and make it optional.
r'^((?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d(?:,\d{3})?$)'
Explanation for (?:,\d{3})?
:
(?: ) Non-capturing group
,\d{3} Comma, then three digits
? Match zero or one times
CodePudding user response:
The quick and dirty way is to use split()
:
text = text.split(",")[0]
text = re.sub(r'^((?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$)', '', text)
You can also update your regex to use add an optional part at the end.
text = re.sub(r'^((?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d),?\d{0,3}$', '', text)
CodePudding user response:
Using re.sub
:
inp = "Followed by comma and three digits: 00:00:46,235"
output = re.sub(r'\b(\d{2}:\d{2}:\d{2}),\d{3}', r'\1', inp)
print(output) # Followed by comma and three digits: 00:00:46