Let's I have a df
in which date column is consecutive:
date a b c
0 2021-10-1 8.0 10 12.0
1 2021-10-2 NaN 16 NaN
2 2021-10-3 1.0 14 NaN
3 2021-10-4 2.0 15 NaN
4 2021-10-5 3.0 16 5.0
5 2021-10-6 4.0 17 6.0
6 2021-10-7 9.0 22 11.0
7 2021-10-8 10.0 23 12.0
8 2021-10-9 11.0 24 13.0
9 2021-10-10 12.0 25 14.0
10 2021-10-11 13.0 26 NaN
11 2021-10-12 NaN 27 NaN
12 2021-10-13 NaN 28 NaN
I would like to filter rows in which date is consecutive and at same time no NaN
for columns a
, b
and c
.
With df.dropna(how='any')
, I get:
date a b c
0 2021-10-1 8.0 10 12.0
4 2021-10-5 3.0 16 5.0
5 2021-10-6 4.0 17 6.0
6 2021-10-7 9.0 22 11.0
7 2021-10-8 10.0 23 12.0
8 2021-10-9 11.0 24 13.0
9 2021-10-10 12.0 25 14.0
But I hope to drop the first row as well even it doesn't have any NaN
, since its date is not consecutive with others:
date a b c
4 2021-10-5 3.0 16 5.0
5 2021-10-6 4.0 17 6.0
6 2021-10-7 9.0 22 11.0
7 2021-10-8 10.0 23 12.0
8 2021-10-9 11.0 24 13.0
9 2021-10-10 12.0 25 14.0
How could I achieve that in Pandas? Thanks.
CodePudding user response:
solution
df[df["date"].diff().eq(pd.Timedelta("1 day"))].dropna()
explanation
This calculates the difference between consecutive elements in the `date1 column
df["date"].diff()
This compares the differences to 1 day, and produces a series of True/False values
df["date"].diff().eq(pd.Timedelta("1 day"))
We can use that to filter to remove any row which does not have the following day in the data.
The last step is to call .dropna
which removes any row which contains a NaN
value
CodePudding user response:
You can test consecutive values by Series.diff
, compare for 1
and for all values if not missing:
df["date"] = pd.to_datetime(df["date"])
df = df[df["date"].diff().dt.days.eq(1) & df.notna().all(axis=1)]
print (df)
date a b c
4 2021-10-05 3.0 16 5.0
5 2021-10-06 4.0 17 6.0
6 2021-10-07 9.0 22 11.0
7 2021-10-08 10.0 23 12.0
8 2021-10-09 11.0 24 13.0
9 2021-10-10 12.0 25 14.0
If there is DatetimeIndex
:
df["date"] = pd.to_datetime(df["date"])
df = df.set_index('date')
df = df[df.index.to_series().diff().dt.days.eq(1) & df.notna().all(axis=1)]
print (df)
a b c
date
2021-10-05 3.0 16 5.0
2021-10-06 4.0 17 6.0
2021-10-07 9.0 22 11.0
2021-10-08 10.0 23 12.0
2021-10-09 11.0 24 13.0
2021-10-10 12.0 25 14.0