Strip text from timestamp and convert to consistent datetime-CodePudding

I have a pandas dataframe with a string column representing a date/time. The date column can have various different timezone endings such as BST, UTC, 01 etc. I don't know a priori which endings each date may have. I've only shown three for illustration purposes.

data = {'Name': ['Bob', 'Mary','Alice'],
        'Age': [21, 19, 20],
        'Date': ['2022-07-07 16:43 ( 01)', '2022-07-07 16:43 (UTC)',
                   '2022-07-07 16:43 (BST)']}
  
# Convert the dictionary into DataFrame
data = pd.DataFrame(data, columns=['Name', 'Age', 
                                 'Date'])

Dataframe:

Name   Age  Date
Bob    21   2022-07-07 16:43 ( 01)
Mary   19   2022-07-07 16:43 (UTC)
Alice  20   2022-07-07 16:43 (BST)

Does a general method exist to remove the various suffixes from the date and convert it to a standard timezone (let's say UTC)?

CodePudding user response：

You can use dateutil.parser.parse with fuzzy=True right out of the box or with a minor modification.

Fuzzy:

>>> import dateutil.parser
>>> time = '2022-07-07 16:43 ( 01)'
>>> dateutil.parser.parse(time, fuzzy=True)
...: datetime.datetime(2022, 7, 7, 16, 43, tzinfo=tzoffset(None, 3600))

Remove parentheses then call

>>> import dateutil.parser
>>> time = '2022-07-07 16:43 ( 01)'
>>> dateutil.parser.parse(time.replace(')', '').replace('(', ''))
...: datetime.datetime(2022, 7, 7, 16, 43, tzinfo=tzoffset(None, 3600))

A more efficient function to parse for a big dataset could be like:

def parse_datetime(time: str) -> datetime.datetime:
    parsable_time = time.translate({ord('('): None,
                        ord(')'): None})
    return dateutil.parser.parse(parsable_time)

The rational is you parse(..., fuzzy=True) is quiet general and can parse things like "Today is January 1, 2047 at 8:21:00AM" properly, in fact the sentence is from the docstring of parse(..., fuzzy=True), it does quiet a bit of book keeping to carry out such amazing task, you don't need that, so getting rid of parentheses could be more effective if you are running it on a sufficiently big dataset in pandas.

CodePudding user response：

Here's a way how you could deal with that input in pandas, also making use of dateutil's parser, but setting the tzinfos keyword.

import pandas as pd
import dateutil

df = pd.DataFrame(
    {'Name': ['Bob', 'Mary','Alice'],
     'Age': [21, 19, 20],
     'Date': ['2022-07-07 16:43 ( 01)', '2022-07-07 16:43 (UTC)', '2022-07-07 16:43 (BST)']},
     )

# cleanup
df["Date"] = df["Date"].str.replace("\((. )\)", "\\1", regex=True)

# first go: parse with "standard" parser, inferring format
df["datetime"] = df["Date"].apply(dateutil.parser.parse)

# now we still have "unparsed" tz:
df["datetime"]
0    2022-07-07 16:43:00 01:00 # was  01 => ok
1    2022-07-07 16:43:00 00:00 # was UTC => ok
2          2022-07-07 16:43:00 # bad...
Name: Date, dtype: object

# second go...
# where there is no tz set:
m = df["datetime"].apply(lambda t: t.tzinfo is None)
# define time zones
tzinfos = {'BST': dateutil.tz.gettz("Europe/London")}
# ...so we can parser ambiguous stuff like BST
df.loc[m, "datetime"] = df["Date"][m].apply(dateutil.parser.parse, tzinfos=tzinfos)

df["datetime"]
0    2022-07-07 16:43:00 01:00
1    2022-07-07 16:43:00 00:00
2    2022-07-07 16:43:00 01:00
Name: datetime, dtype: object # notice the data type of the series

# now to UTC:
df["UTC"] = df["datetime"].apply(lambda t: t.astimezone(dateutil.tz.UTC))

df["UTC"]
0   2022-07-07 15:43:00 00:00
1   2022-07-07 16:43:00 00:00
2   2022-07-07 15:43:00 00:00
Name: UTC, dtype: datetime64[ns, tzutc()] # datetime64 dtype now :)

Notes

you need all the applys because you're dealing with mixed UTC offsets in one Series. pandas' dt accessor can only handle a single UTC offset / time zone in one Series
you can skip the first go and the masking step in the second go if you also define 01 for example in the tzinfos mapping