Home > Software design >  Converting multiple columns to datetime using iloc or loc
Converting multiple columns to datetime using iloc or loc

Time:11-15

I am unsure if this is the expected behavior, but below is an example dataframe.

df = pd.DataFrame([['2020-01-01','2020-06-30','A'],
                    ['2020-07-01','2020-12-31','B']],
                        columns = ['start_date', 'end_date', 'field1'])

Before I upgraded to pandas version 1.3.4, I believe I was able to convert column dtypes like this:

df.iloc[:,0:2] = df.iloc[:,0:2].apply(pd.to_datetime)

Although is appears to have converted the columns to datetime,

            start_date             end_date field1
0  2020-01-01 00:00:00  2020-06-30 00:00:00      A
1  2020-07-01 00:00:00  2020-12-31 00:00:00      B

The dtypes appear to still be objects:

start_date    object
end_date      object
field1        object

I know I am able to do the same thing using the code below, I am just wondering if this is the intended behavior of both loc and iloc.

df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
start_date    datetime64[ns]
end_date      datetime64[ns]
field1                object

CodePudding user response:

This behaviour is part of the changes introduced in 1.3.0.

Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

Meaning that iloc and loc will try to not change the dtype of an array if the new array can fit in the existing type:

import pandas as pd

df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df.loc[:, 'A':'B'] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)

Output:

     A    B
0  1.2  3.4
1  2.3  4.5
A    float64
B    float64
dtype: object
     A    B
0  1.0  3.0
1  2.0  4.0
A    float64
B    float64
dtype: object

Conversely: Never operate inplace when setting frame[keys] = values:

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

import pandas as pd

df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df[['A', 'B']] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)

Output:

     A    B
0  1.2  3.4
1  2.3  4.5
A    float64
B    float64
dtype: object
   A  B
0  1  3
1  2  4
A    int64
B    int64
dtype: object

With these changes in mind, we now have to do something like:

import pandas as pd

df = pd.DataFrame([['2020-01-01', '2020-06-30', 'A'],
                   ['2020-07-01', '2020-12-31', 'B']],
                  columns=['start_date', 'end_date', 'field1'])

cols = df.columns[0:2]
df[cols] = df[cols].apply(pd.to_datetime)
# or
# df[df.columns[0:2]] = df.iloc[:, 0:2].apply(pd.to_datetime)

print(df)
print(df.dtypes)

Output:

  start_date   end_date field1
0 2020-01-01 2020-06-30      A
1 2020-07-01 2020-12-31      B
start_date    datetime64[ns]
end_date      datetime64[ns]
field1                object
dtype: object
  • Related