Why do I get NaT values rather than NaN when adding partial rows to this Dataframe?-CodePudding

I have a script that reads a .csv file into a dataframe and then allows to user to extend the dataframe by adding extra data to it. It will take the last value in the date column and start prompting the user for a value day-by-day.

If the user doesn't specify anything for input then the value gets cast to a math.nan. Except when I go append the row to the dataframe, the supposed NaN gets cast to a NaT.

I've recreated a reproducible example below.

How do I ensure that my NaNs aren't cast to NaTs?

#!/usr/bin/env python

import pandas as pd
import datetime as dt
import math

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
})

last_recorded_date = df['date'].iloc[-1]
next_date = last_recorded_date   dt.timedelta(days=1)
df.loc[len(df.index)] = [next_date, math.nan]

print(df)
#         date weight
# 0 2022-05-01  250.0
# 1 2022-05-02  249.0
# 2 2022-05-03  247.0
# 3 2022-05-06    NaT

CodePudding user response：

When setting a row from a list, the list is first converted to a Series. Elements of a Series have to be all the same type; the first value is a datetime; so every value is converted to a datetime in the resulting Series. In particular, math.nan becomes NaT. Pandas does not use the existing column types to inform the process; instead, the column types are adjusted as needed - the weight column's type expands from float to object.

From my testing, using a tuple instead seems to fix the problem:

df.loc[len(df.index)] = (next_date, math.nan)

CodePudding user response：

That is weird. But some experimentation reveals some clues:

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

# Try this
df.loc[4] = None

Raises:

FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

That doesn't exactly explain why it added NaT to the second column but it does indicate that the types need to be specified when appending to the existing dataframe.

One solution, as explained here, is as follows:

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

next_date = pd.Timestamp('2022-05-04')
df = df.append(pd.DataFrame([{'date': next_date, 'weight': np.nan}]), ignore_index=True)
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()

However, this raises:

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(pd.DataFrame([{'date': next_date, 'weight': np.nan}]), ignore_index=True)

So I guess the right solution is now:

new_row = pd.DataFrame([{'date': next_date, 'weight': np.nan}])
df = pd.concat([df, new_row]).reset_index(drop=True)
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()

But I must ask, why are you appending to a dataframe in this way? It is quite inefficient and should be avoided if possible.

CodePudding user response：

import pandas as pd
import datetime as dt
import math

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

last_recorded_date = df['date'].iloc[-1]

while True:
    next_date = last_recorded_date   dt.timedelta(days=1)
    weight = input(f"{next_date}: ")
    if weight == 'q':
        break
    elif weight == '':
        weight = math.nan
    else:
        weight = float(weight)

    df.loc[len(df.index)] = [next_date, weight]
    last_recorded_date = next_date

df = df['weight'].replace(pd.NaT, math.nan)

print(df)