I have a pandas
dataframe with a string column representing a date/time. The date
column can have various different timezone endings such as BST, UTC, 01 etc. I don't know a priori which endings each date may have. I've only shown three for illustration purposes.
data = {'Name': ['Bob', 'Mary','Alice'],
'Age': [21, 19, 20],
'Date': ['2022-07-07 16:43 ( 01)', '2022-07-07 16:43 (UTC)',
'2022-07-07 16:43 (BST)']}
# Convert the dictionary into DataFrame
data = pd.DataFrame(data, columns=['Name', 'Age',
'Date'])
Dataframe:
Name Age Date
Bob 21 2022-07-07 16:43 ( 01)
Mary 19 2022-07-07 16:43 (UTC)
Alice 20 2022-07-07 16:43 (BST)
Does a general method exist to remove the various suffixes from the date and convert it to a standard timezone (let's say UTC)?
CodePudding user response:
You can use dateutil.parser.parse
with fuzzy=True
right out of the box or with a minor modification.
Fuzzy:
>>> import dateutil.parser
>>> time = '2022-07-07 16:43 ( 01)'
>>> dateutil.parser.parse(time, fuzzy=True)
...: datetime.datetime(2022, 7, 7, 16, 43, tzinfo=tzoffset(None, 3600))
Remove parentheses then call
>>> import dateutil.parser
>>> time = '2022-07-07 16:43 ( 01)'
>>> dateutil.parser.parse(time.replace(')', '').replace('(', ''))
...: datetime.datetime(2022, 7, 7, 16, 43, tzinfo=tzoffset(None, 3600))
A more efficient function to parse for a big dataset could be like:
def parse_datetime(time: str) -> datetime.datetime:
parsable_time = time.translate({ord('('): None,
ord(')'): None})
return dateutil.parser.parse(parsable_time)
The rational is you parse(..., fuzzy=True)
is quiet general and can parse things like "Today is January 1, 2047 at 8:21:00AM" properly, in fact the sentence is from the docstring of parse(..., fuzzy=True)
, it does quiet a bit of book keeping to carry out such amazing task, you don't need that, so getting rid of parentheses could be more effective if you are running it on a sufficiently big dataset in pandas.
CodePudding user response:
Here's a way how you could deal with that input in pandas, also making use of dateutil's parser, but setting the tzinfos keyword.
import pandas as pd
import dateutil
df = pd.DataFrame(
{'Name': ['Bob', 'Mary','Alice'],
'Age': [21, 19, 20],
'Date': ['2022-07-07 16:43 ( 01)', '2022-07-07 16:43 (UTC)', '2022-07-07 16:43 (BST)']},
)
# cleanup
df["Date"] = df["Date"].str.replace("\((. )\)", "\\1", regex=True)
# first go: parse with "standard" parser, inferring format
df["datetime"] = df["Date"].apply(dateutil.parser.parse)
# now we still have "unparsed" tz:
df["datetime"]
0 2022-07-07 16:43:00 01:00 # was 01 => ok
1 2022-07-07 16:43:00 00:00 # was UTC => ok
2 2022-07-07 16:43:00 # bad...
Name: Date, dtype: object
# second go...
# where there is no tz set:
m = df["datetime"].apply(lambda t: t.tzinfo is None)
# define time zones
tzinfos = {'BST': dateutil.tz.gettz("Europe/London")}
# ...so we can parser ambiguous stuff like BST
df.loc[m, "datetime"] = df["Date"][m].apply(dateutil.parser.parse, tzinfos=tzinfos)
df["datetime"]
0 2022-07-07 16:43:00 01:00
1 2022-07-07 16:43:00 00:00
2 2022-07-07 16:43:00 01:00
Name: datetime, dtype: object # notice the data type of the series
# now to UTC:
df["UTC"] = df["datetime"].apply(lambda t: t.astimezone(dateutil.tz.UTC))
df["UTC"]
0 2022-07-07 15:43:00 00:00
1 2022-07-07 16:43:00 00:00
2 2022-07-07 15:43:00 00:00
Name: UTC, dtype: datetime64[ns, tzutc()] # datetime64 dtype now :)
Notes
- you need all the
apply
s because you're dealing with mixed UTC offsets in one Series. pandas'dt
accessor can only handle a single UTC offset / time zone in one Series - you can skip the first go and the masking step in the second go if you also define
01
for example in the tzinfos mapping