I have a DataFrame that I build row-by-row (by necessity). My issue is that at the end, all dtypes
are object
. This is not so if the DataFrame is created with all the data at once.
Let me explain what I mean.
import pandas as pd
from IPython.display import display
# Example data
cols = ['P/N','Date','x','y','z']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
# Canonical way to build DataFrame (if you have all the data ready)
dg = pd.DataFrame({'P/N':PN,'Date':dates,'x':xd,'y':yd,'z':zd})
display(dg)
dg.dtypes
Here's what I get. Note the correct dtypes
:
OK, now I do the same thing row-by-row:
# Build empty DataFrame
cols = ['P/N','Date','x','y','z']
df = pd.DataFrame(columns=cols)
# Add rows in loop
for i in range(3):
new_row = {'P/N':PN[i],'Date':pd.to_datetime(dates[i]),'x':xd[i],'y':yd[i],'z':zd[i]}
# deprecated
#df = df.append(new_row,ignore_index=True)
df = pd.concat([df,pd.DataFrame([new_row])],ignore_index=True)
display(df)
df.dtypes
Note the []
around new_row
, otherwise you get a stupid error. (I really don't understand the deprecation of append
, BTW. It allows for much more readable code)
But now I get this: This isn't the same as above, all the dtypes
are object
!
The only way I found to recover my dtypes is to use infer_objects
:
# Recover dtypes by using infer_objects()
dh = df.infer_objects()
dh.dtypes
And dh
is now the same as dg
.
Note that even if I do
df = pd.concat([df,pd.DataFrame([new_row]).infer_objects()],ignore_index=True)
above, it still does not work. I believe this is due to a bug in concat
: when an empty DataFrame is concat'ed to a non-empty DataFrame, the resulting DataFrame fails to take over the dtypes of second DataFrame. We can verify this with:
pd.concat([pd.DataFrame(),df],ignore_index=True).dtypes
and all dtypes
are still object
.
Is there a better way to build a DataFrame row-by-row and have the correct dtypes inferred automatically?
CodePudding user response:
Your initial dataframe df = pd.DataFrame(columns=cols)
columns are all of type object
because it has no data to infer dtypes from. So you need to set them with astype
. Like @mozway commented it is recommended to use a list to add the from the loop.
I expect this to work for you:
import pandas as pd
cols = ['P/N','Date','x','y','z']
dtypes = ['object', 'datetime64[ns]', 'int64', 'float64', 'float64']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
df = pd.DataFrame(columns=cols).astype({c:d for c,d in zip(cols,dtypes)})
new_rows = []
for i in range(3):
new_row = [PN[i], pd.to_datetime(dates[i]), xd[i], yd[i], zd[i]]
new_rows.append(new_row)
df_new = pd.concat([df, pd.DataFrame(new_rows, columns=cols)], axis=0)
print(df_new)
print(df_new.info())
Output:
P/N Date x y z
0 10a1 2022-07-01 0 1.1 -0.8
1 10a2 2022-07-03 1 1.2 0.0
2 10a3 2022-07-05 2 1.3 0.8
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 P/N 3 non-null object
1 Date 3 non-null datetime64[ns]
2 x 3 non-null int64
3 y 3 non-null float64
4 z 3 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 144.0 bytes