I'm working with datetime information in pandas and wanted to convert a bunch of datetime64[ns]
columns to str
. I noticed a different behavior from the two approaches that I expected to yield the same result.
Here's a MCVE.
import pandas as pd
# Create a dataframe with dates according to ISO8601
df = pd.DataFrame(
{
"dt_column": [
"2023-01-01",
"2023-01-02",
"2023-01-02",
]
}
)
# Convert the dates to datetime columns
# (I expect the time portion to be 00:00:00)
df["dt_column"] = pd.to_datetime(df["dt_column"])
df["str_from_astype"] = df["dt_column"].astype(str)
df["str_from_apply"] = df["dt_column"].apply(str)
print(df)
print("")
print(f"Datatypes of the dataframe \n{df.dtypes}")
Output
dt_column str_from_astype str_from_apply
0 2023-01-01 2023-01-01 2023-01-01 00:00:00
1 2023-01-02 2023-01-02 2023-01-02 00:00:00
2 2023-01-02 2023-01-02 2023-01-02 00:00:00
Datatypes of the dataframe
dt_column datetime64[ns]
str_from_astype object
str_from_apply object
dtype: object
If I use .astype(str)
the time information is lost and when I use .apply(str)
the time information is retained (or inferred).
Why is that?
(Pandas v1.5.2, Python 3.9.15)
CodePudding user response:
The time information is never lost, if you use 2023-01-02 12:00
, you'll see that all times will be present with astype
, but also visible in the original datetime column:
dt_column str_from_astype str_from_apply
0 2023-01-01 00:00:00 2023-01-01 00:00:00 2023-01-01 00:00:00
1 2023-01-02 00:00:00 2023-01-02 00:00:00 2023-01-02 00:00:00
2 2023-01-02 12:00:00 2023-01-02 12:00:00 2023-01-02 12:00:00
With apply
, the python str
builtin is applied on each Timestamp
object, which always shows a full format:
str(pd.Timestamp('2023-01-01'))
# '2023-01-01 00:00:00'
With astype
, the formatting is handled by pandas.io.formats.format.SeriesFormatter
, which is a bit smarter and decides on the output format depending on the context (here other values in the Series and the presence of a non-null time).
The canonical way to be explicit is anyway to use dt.strftime
:
# without time
df["dt_column"].dt.strftime('%Y-%m-%d')
# with time
df["dt_column"].dt.strftime('%Y-%m-%d %H:%M:%S')