Home > front end >  Filter rows without any NaNs and date is consecutive in Pandas
Filter rows without any NaNs and date is consecutive in Pandas

Time:10-28

Let's I have a df in which date column is consecutive:

          date     a   b     c
0    2021-10-1   8.0  10  12.0
1    2021-10-2   NaN  16   NaN
2    2021-10-3   1.0  14   NaN
3    2021-10-4   2.0  15   NaN
4    2021-10-5   3.0  16   5.0
5    2021-10-6   4.0  17   6.0
6    2021-10-7   9.0  22  11.0
7    2021-10-8  10.0  23  12.0
8    2021-10-9  11.0  24  13.0
9   2021-10-10  12.0  25  14.0
10  2021-10-11  13.0  26   NaN
11  2021-10-12   NaN  27   NaN
12  2021-10-13   NaN  28   NaN

I would like to filter rows in which date is consecutive and at same time no NaN for columns a, b and c.

With df.dropna(how='any'), I get:

         date     a   b     c
0   2021-10-1   8.0  10  12.0
4   2021-10-5   3.0  16   5.0
5   2021-10-6   4.0  17   6.0
6   2021-10-7   9.0  22  11.0
7   2021-10-8  10.0  23  12.0
8   2021-10-9  11.0  24  13.0
9  2021-10-10  12.0  25  14.0

But I hope to drop the first row as well even it doesn't have any NaN, since its date is not consecutive with others:

         date     a   b     c
4   2021-10-5   3.0  16   5.0
5   2021-10-6   4.0  17   6.0
6   2021-10-7   9.0  22  11.0
7   2021-10-8  10.0  23  12.0
8   2021-10-9  11.0  24  13.0
9  2021-10-10  12.0  25  14.0

How could I achieve that in Pandas? Thanks.

CodePudding user response:

solution

df[df["date"].diff().eq(pd.Timedelta("1 day"))].dropna()

explanation

This calculates the difference between consecutive elements in the `date1 column

df["date"].diff()

This compares the differences to 1 day, and produces a series of True/False values

df["date"].diff().eq(pd.Timedelta("1 day"))

We can use that to filter to remove any row which does not have the following day in the data.

The last step is to call .dropna which removes any row which contains a NaN value

CodePudding user response:

You can test consecutive values by Series.diff, compare for 1 and for all values if not missing:

df["date"] = pd.to_datetime(df["date"])

df = df[df["date"].diff().dt.days.eq(1) & df.notna().all(axis=1)]

print (df)
        date     a   b     c
4 2021-10-05   3.0  16   5.0
5 2021-10-06   4.0  17   6.0
6 2021-10-07   9.0  22  11.0
7 2021-10-08  10.0  23  12.0
8 2021-10-09  11.0  24  13.0
9 2021-10-10  12.0  25  14.0

If there is DatetimeIndex:

df["date"] = pd.to_datetime(df["date"])

df = df.set_index('date')

df = df[df.index.to_series().diff().dt.days.eq(1) & df.notna().all(axis=1)]
print (df)
               a   b     c
date                      
2021-10-05   3.0  16   5.0
2021-10-06   4.0  17   6.0
2021-10-07   9.0  22  11.0
2021-10-08  10.0  23  12.0
2021-10-09  11.0  24  13.0
2021-10-10  12.0  25  14.0
  • Related