I am unsure if this is the expected behavior, but below is an example dataframe.
df = pd.DataFrame([['2020-01-01','2020-06-30','A'],
['2020-07-01','2020-12-31','B']],
columns = ['start_date', 'end_date', 'field1'])
Before I upgraded to pandas version 1.3.4
, I believe I was able to convert column dtypes like this:
df.iloc[:,0:2] = df.iloc[:,0:2].apply(pd.to_datetime)
Although is appears to have converted the columns to datetime,
start_date end_date field1
0 2020-01-01 00:00:00 2020-06-30 00:00:00 A
1 2020-07-01 00:00:00 2020-12-31 00:00:00 B
The dtypes appear to still be objects:
start_date object
end_date object
field1 object
I know I am able to do the same thing using the code below, I am just wondering if this is the intended behavior of both loc
and iloc
.
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
CodePudding user response:
This behaviour is part of the changes introduced in 1.3.0.
Try operating inplace when setting values with loc
and iloc
When setting an entire column using
loc
oriloc
, pandas will try to insert the values into the existing data rather than create an entirely new array.
Meaning that iloc
and loc
will try to not change the dtype of an array if the new array can fit in the existing type:
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df.loc[:, 'A':'B'] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1.0 3.0
1 2.0 4.0
A float64
B float64
dtype: object
Conversely:
Never operate inplace when setting frame[keys] = values
:
When setting multiple columns using
frame[keys] = values
new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df[['A', 'B']] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1 3
1 2 4
A int64
B int64
dtype: object
With these changes in mind, we now have to do something like:
import pandas as pd
df = pd.DataFrame([['2020-01-01', '2020-06-30', 'A'],
['2020-07-01', '2020-12-31', 'B']],
columns=['start_date', 'end_date', 'field1'])
cols = df.columns[0:2]
df[cols] = df[cols].apply(pd.to_datetime)
# or
# df[df.columns[0:2]] = df.iloc[:, 0:2].apply(pd.to_datetime)
print(df)
print(df.dtypes)
Output:
start_date end_date field1
0 2020-01-01 2020-06-30 A
1 2020-07-01 2020-12-31 B
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
dtype: object