I have a script that reads a .csv file into a dataframe and then allows to user to extend the dataframe by adding extra data to it. It will take the last value in the date
column and start prompting the user for a value day-by-day.
If the user doesn't specify anything for input
then the value gets cast to a math.nan
. Except when I go append the row to the dataframe, the supposed NaN
gets cast to a NaT
.
I've recreated a reproducible example below.
How do I ensure that my NaN
s aren't cast to NaT
s?
#!/usr/bin/env python
import pandas as pd
import datetime as dt
import math
df = pd.DataFrame({
'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
'weight': [250., 249, 247],
})
last_recorded_date = df['date'].iloc[-1]
next_date = last_recorded_date dt.timedelta(days=1)
df.loc[len(df.index)] = [next_date, math.nan]
print(df)
# date weight
# 0 2022-05-01 250.0
# 1 2022-05-02 249.0
# 2 2022-05-03 247.0
# 3 2022-05-06 NaT
CodePudding user response:
When setting a row from a list
, the list is first converted to a Series
. Elements of a Series
have to be all the same type; the first value is a datetime
; so every value is converted to a datetime
in the resulting Series
. In particular, math.nan
becomes NaT
. Pandas does not use the existing column types to inform the process; instead, the column types are adjusted as needed - the weight
column's type expands from float
to object
.
From my testing, using a tuple instead seems to fix the problem:
df.loc[len(df.index)] = (next_date, math.nan)
CodePudding user response:
That is weird. But some experimentation reveals some clues:
df = pd.DataFrame({
'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
'weight': [250., 249, 247],
})
# Try this
df.loc[4] = None
Raises:
FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
That doesn't exactly explain why it added NaT
to the second column but it does indicate that the types need to be specified when appending to the existing dataframe.
One solution, as explained here, is as follows:
df = pd.DataFrame({
'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
'weight': [250., 249, 247],
})
next_date = pd.Timestamp('2022-05-04')
df = df.append(pd.DataFrame([{'date': next_date, 'weight': np.nan}]), ignore_index=True)
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()
However, this raises:
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df = df.append(pd.DataFrame([{'date': next_date, 'weight': np.nan}]), ignore_index=True)
So I guess the right solution is now:
new_row = pd.DataFrame([{'date': next_date, 'weight': np.nan}])
df = pd.concat([df, new_row]).reset_index(drop=True)
assert (df.dtypes.values == ('<M8[ns]', 'float64')).all()
But I must ask, why are you appending to a dataframe in this way? It is quite inefficient and should be avoided if possible.
CodePudding user response:
import pandas as pd
import datetime as dt
import math
df = pd.DataFrame({
'date': pd.to_datetime(['2022-05-01', '2022-05-02', '2022-05-03']),
'weight': [250., 249, 247],
})
last_recorded_date = df['date'].iloc[-1]
while True:
next_date = last_recorded_date dt.timedelta(days=1)
weight = input(f"{next_date}: ")
if weight == 'q':
break
elif weight == '':
weight = math.nan
else:
weight = float(weight)
df.loc[len(df.index)] = [next_date, weight]
last_recorded_date = next_date
df = df['weight'].replace(pd.NaT, math.nan)
print(df)