Home > OS >  How to build DataFrame row-by-row with correct dtypes?
How to build DataFrame row-by-row with correct dtypes?

Time:07-30

I have a DataFrame that I build row-by-row (by necessity). My issue is that at the end, all dtypes are object. This is not so if the DataFrame is created with all the data at once.

Let me explain what I mean.

import pandas as pd
from IPython.display import display

# Example data
cols = ['P/N','Date','x','y','z']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
# Canonical way to build DataFrame (if you have all the data ready)
dg = pd.DataFrame({'P/N':PN,'Date':dates,'x':xd,'y':yd,'z':zd})
display(dg)
dg.dtypes

Here's what I get. Note the correct dtypes:

Results

OK, now I do the same thing row-by-row:

# Build empty DataFrame
cols = ['P/N','Date','x','y','z']
df = pd.DataFrame(columns=cols)
# Add rows in loop
for i in range(3):
    new_row = {'P/N':PN[i],'Date':pd.to_datetime(dates[i]),'x':xd[i],'y':yd[i],'z':zd[i]}
    # deprecated
    #df = df.append(new_row,ignore_index=True)
    df = pd.concat([df,pd.DataFrame([new_row])],ignore_index=True)
display(df)
df.dtypes

Note the [] around new_row, otherwise you get a stupid error. (I really don't understand the deprecation of append, BTW. It allows for much more readable code)

But now I get this: This isn't the same as above, all the dtypes are object!

Results2

The only way I found to recover my dtypes is to use infer_objects:

# Recover dtypes by using infer_objects()
dh = df.infer_objects()
dh.dtypes

And dh is now the same as dg. Note that even if I do

df = pd.concat([df,pd.DataFrame([new_row]).infer_objects()],ignore_index=True)

above, it still does not work. I believe this is due to a bug in concat: when an empty DataFrame is concat'ed to a non-empty DataFrame, the resulting DataFrame fails to take over the dtypes of second DataFrame. We can verify this with:

pd.concat([pd.DataFrame(),df],ignore_index=True).dtypes

and all dtypes are still object. Is there a better way to build a DataFrame row-by-row and have the correct dtypes inferred automatically?

CodePudding user response:

Your initial dataframe df = pd.DataFrame(columns=cols) columns are all of type object because it has no data to infer dtypes from. So you need to set them with astype. Like @mozway commented it is recommended to use a list to add the from the loop.
I expect this to work for you:

import pandas as pd
cols = ['P/N','Date','x','y','z']
dtypes = ['object', 'datetime64[ns]', 'int64', 'float64', 'float64']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]

df = pd.DataFrame(columns=cols).astype({c:d for c,d in zip(cols,dtypes)})
new_rows = []
for i in range(3):
    new_row = [PN[i], pd.to_datetime(dates[i]), xd[i], yd[i], zd[i]]
    new_rows.append(new_row)
df_new = pd.concat([df, pd.DataFrame(new_rows, columns=cols)], axis=0) 
print(df_new)
print(df_new.info())

Output:

    P/N       Date  x    y    z
0  10a1 2022-07-01  0  1.1 -0.8
1  10a2 2022-07-03  1  1.2  0.0
2  10a3 2022-07-05  2  1.3  0.8  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   P/N     3 non-null      object        
 1   Date    3 non-null      datetime64[ns]
 2   x       3 non-null      int64         
 3   y       3 non-null      float64       
 4   z       3 non-null      float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 144.0  bytes
  • Related