Home > Net >  Pandas fillna for chronologically ordered dates
Pandas fillna for chronologically ordered dates

Time:03-15

I have a pandas dataframe of some time series data which contains some missing dates. I want to fill these based on the dates in the adjacent rows.

For example, the entries in the dataframe are chronologically ordered thus I know that the two missing dates in the below dataframe must also be 15/3/22 because the above and below entires are both 15/3/22.

Date Time etc.
15/3/22 1:34:40 AM ...
NaN 1:36:18 AM ...
NaN 1:36:20 AM ...
15/3/22 3:08:23 AM ...

Alternatively, when the entries are days apart, you won't be able to tell which date the missing values should be (3/3, 4/3, or 5/3) and thus they should be left as NaN.

Date Time etc.
03/3/22 8:50:15 AM ...
NaN 10:40:00 AM ...
NaN 12:10:40 AM ...
05/3/22 3:08:23 AM ...

How can I do this in Pandas? Thanks

Sample code to create the example dataframes:

data1 = [['15/3/22', '1:34:40'], ['', '1:36:18'], ['', '1:36:20'],['15/3/22', '3:08:23']]
data2 = [['03/3/22', '8:50:15'], ['', '10:40:00'], ['', '12:10:40'], ['05/3/22', '3:08:23']]
df1 = pd.DataFrame(data1, columns = ['Date', 'Time'])
df2 = pd.DataFrame(data2, columns = ['Date', 'Time'])

df1['Date'] = pd.to_datetime( df1['Date'], format='%d/%m/%y', errors="coerce" )
df1['Time'] = pd.to_datetime( df1['Time'], format='%H:%M:%S', errors="coerce" )
df2['Date'] = pd.to_datetime( df2['Date'], format='%d/%m/%y', errors="coerce" )
df2['Time'] = pd.to_datetime( df2['Time'], format='%H:%M:%S', errors="coerce" )

CodePudding user response:

There's a lot of good examples of various strategies for filling missing data in the documentation here: https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-values-fillna

I don't think a particular method exists that follows the filling strategy you've described (though it is a very reasonable one), but I think we can use a trick to achieve it quite easily: by doing a forward-fill and a backward-fill and then only keeping the changes when these two give the same result, which must be when it's the same date before and after the run of NaNs.

date_ff = df["Date"].fillna(method="ffill")
date_bf = df["Date"].fillna(method="bfill")

df.loc[date_ff == date_bf, "Date"] = date_ff

In case you're unfamiliar with the .loc syntax, that last line is filtering to just the rows where date_ff and date_bf are equal and setting the Date column to the corresponding value from date_ff (we could have used date_bf - the point is that they give exactly the same result).

Note that since the non-NaN values aren't touched by fillna() they will be equal in date_ff and date_bf so are effectively left unchanged.

  • Related