Home > Software engineering >  Different behavior of apply(str) and astype(str) for datetime64[ns] pandas columns
Different behavior of apply(str) and astype(str) for datetime64[ns] pandas columns

Time:01-03

I'm working with datetime information in pandas and wanted to convert a bunch of datetime64[ns] columns to str. I noticed a different behavior from the two approaches that I expected to yield the same result.

Here's a MCVE.

import pandas as pd

# Create a dataframe with dates according to ISO8601
df = pd.DataFrame(
    {
        "dt_column": [
            "2023-01-01",
            "2023-01-02",
            "2023-01-02",
        ]
    }
)

# Convert the dates to datetime columns
# (I expect the time portion to be 00:00:00)
df["dt_column"] = pd.to_datetime(df["dt_column"])

df["str_from_astype"] = df["dt_column"].astype(str)
df["str_from_apply"] = df["dt_column"].apply(str)

print(df)
print("")
print(f"Datatypes of the dataframe \n{df.dtypes}")

Output

   dt_column str_from_astype       str_from_apply
0 2023-01-01      2023-01-01  2023-01-01 00:00:00
1 2023-01-02      2023-01-02  2023-01-02 00:00:00
2 2023-01-02      2023-01-02  2023-01-02 00:00:00

Datatypes of the dataframe 
dt_column          datetime64[ns]
str_from_astype            object
str_from_apply             object
dtype: object

If I use .astype(str) the time information is lost and when I use .apply(str) the time information is retained (or inferred).

Why is that?

(Pandas v1.5.2, Python 3.9.15)

CodePudding user response:

The time information is never lost, if you use 2023-01-02 12:00, you'll see that all times will be present with astype, but also visible in the original datetime column:

            dt_column      str_from_astype       str_from_apply
0 2023-01-01 00:00:00  2023-01-01 00:00:00  2023-01-01 00:00:00
1 2023-01-02 00:00:00  2023-01-02 00:00:00  2023-01-02 00:00:00
2 2023-01-02 12:00:00  2023-01-02 12:00:00  2023-01-02 12:00:00

With apply, the python str builtin is applied on each Timestamp object, which always shows a full format:

str(pd.Timestamp('2023-01-01'))
# '2023-01-01 00:00:00'

With astype, the formatting is handled by pandas.io.formats.format.SeriesFormatter, which is a bit smarter and decides on the output format depending on the context (here other values in the Series and the presence of a non-null time).

The canonical way to be explicit is anyway to use dt.strftime:

# without time
df["dt_column"].dt.strftime('%Y-%m-%d')

# with time
df["dt_column"].dt.strftime('%Y-%m-%d %H:%M:%S')
  • Related