Home > Software engineering >  Pandas keeps casting NaN values to NaT values when adding new rows to data frame
Pandas keeps casting NaN values to NaT values when adding new rows to data frame

Time:05-30

I have a script that reads a csv into a dataframe and then allows to user to extend the dataframe by adding extra data to it. It will take the last value in the date column and start prompting the user for a value day-by-day.

If the user doesn't specify anything for input then the value gets cast to a math.nan. Except when I go append the row to the dataframe, the supposed NaN gets cast to a NaT.

I've recreated a reproducible example below.

How do I ensure that my NaNs aren't cast to NaTs?

#!/usr/bin/env python

import pandas as pd
import datetime as dt
import math

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

last_recorded_date = df['date'].iloc[-1]

while True:
    next_date = last_recorded_date   dt.timedelta(days=1)
    weight = input(f"{next_date}: ")
    if weight == 'q':
        break
    elif weight == '':
        weight = math.nan
    else:
        weight = float(weight)

    df.loc[len(df.index)] = [next_date, weight]
    last_recorded_date = next_date

print(df)
#         date weight
# 0 2022-05-01  250.0
# 1 2022-05-02  249.0
# 2 2022-05-03  247.0
# 3 2022-05-04  243.0
# 4 2022-05-05  240.0
# 5 2022-05-06    NaT

CodePudding user response:

That is weird. But some experimentation reveals some clues:

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

# Try this
df.loc[4] = None

Raises:

FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

That doesn't exactly explain why it added NaT to the second column but it does indicate that the types need to be specified when appending to the existing dataframe.

One solution, as explained here, is as follows:

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

df = df.append(pd.DataFrame([{'date': pd.NaT, 'weight': np.nan}]), ignore_index=True)
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()

However, this raises:

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(pd.DataFrame([{'date': pd.NaT, 'weight': np.nan}]), ignore_index=True)

So I guess the right solution is now:

new_row = pd.DataFrame([{'date': pd.NaT, 'weight': np.nan}])
df = pd.concat([df, new_row])
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()

But I must ask, why are you appending to a dataframe in this way? It is quite inefficient and should be avoided if possible.

CodePudding user response:

import pandas as pd
import datetime as dt
import math

df = pd.DataFrame({
    'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
    'weight': [250., 249, 247],
    })

last_recorded_date = df['date'].iloc[-1]

while True:
    next_date = last_recorded_date   dt.timedelta(days=1)
    weight = input(f"{next_date}: ")
    if weight == 'q':
        break
    elif weight == '':
        weight = math.nan
    else:
        weight = float(weight)

    df.loc[len(df.index)] = [next_date, weight]
    last_recorded_date = next_date

df = df['weight'].replace(pd.NaT, math.nan)

print(df)
  • Related