I have a pandas dataframe of some time series data which contains some missing dates. I want to fill these based on the dates in the adjacent rows.
For example, the entries in the dataframe are chronologically ordered thus I know that the two missing dates in the below dataframe must also be 15/3/22 because the above and below entires are both 15/3/22.
Date | Time | etc. |
---|---|---|
15/3/22 | 1:34:40 AM | ... |
NaN | 1:36:18 AM | ... |
NaN | 1:36:20 AM | ... |
15/3/22 | 3:08:23 AM | ... |
Alternatively, when the entries are days apart, you won't be able to tell which date the missing values should be (3/3, 4/3, or 5/3) and thus they should be left as NaN.
Date | Time | etc. |
---|---|---|
03/3/22 | 8:50:15 AM | ... |
NaN | 10:40:00 AM | ... |
NaN | 12:10:40 AM | ... |
05/3/22 | 3:08:23 AM | ... |
How can I do this in Pandas? Thanks
Sample code to create the example dataframes:
data1 = [['15/3/22', '1:34:40'], ['', '1:36:18'], ['', '1:36:20'],['15/3/22', '3:08:23']]
data2 = [['03/3/22', '8:50:15'], ['', '10:40:00'], ['', '12:10:40'], ['05/3/22', '3:08:23']]
df1 = pd.DataFrame(data1, columns = ['Date', 'Time'])
df2 = pd.DataFrame(data2, columns = ['Date', 'Time'])
df1['Date'] = pd.to_datetime( df1['Date'], format='%d/%m/%y', errors="coerce" )
df1['Time'] = pd.to_datetime( df1['Time'], format='%H:%M:%S', errors="coerce" )
df2['Date'] = pd.to_datetime( df2['Date'], format='%d/%m/%y', errors="coerce" )
df2['Time'] = pd.to_datetime( df2['Time'], format='%H:%M:%S', errors="coerce" )
CodePudding user response:
There's a lot of good examples of various strategies for filling missing data in the documentation here: https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-values-fillna
I don't think a particular method exists that follows the filling strategy you've described (though it is a very reasonable one), but I think we can use a trick to achieve it quite easily: by doing a forward-fill and a backward-fill and then only keeping the changes when these two give the same result, which must be when it's the same date before and after the run of NaN
s.
date_ff = df["Date"].fillna(method="ffill")
date_bf = df["Date"].fillna(method="bfill")
df.loc[date_ff == date_bf, "Date"] = date_ff
In case you're unfamiliar with the .loc
syntax, that last line is filtering to just the rows where date_ff
and date_bf
are equal and setting the Date column to the corresponding value from date_ff
(we could have used date_bf
- the point is that they give exactly the same result).
Note that since the non-NaN
values aren't touched by fillna()
they will be equal in date_ff
and date_bf
so are effectively left unchanged.