How to remove the all values of a specific person from dataframe which is not continuous based on da-CodePudding

date      consumption  customer_id
2018-01-01     12             111
2018-01-02     12             111
*2018-01-03*   14             111   
*2018-01-05*   12             111
2018-01-06     45             111
2018-01-07     34             111 
2018-01-01     23             112 
2018-01-02     23             112
2018-01-03     45             112
2018-01-04     34             112
2018-01-05     23             112
2018-01-06     34             112
2018-01-01     23             113
2018-01-02     34             113
2018-01-03     45             113
2018-01-04     34             113

The values in customer 111 is not continuous, it has missing value in 2018-01-04, so i want to remove all 111 from my dataframe in pandas.

date      consumption  customer_id
2018-01-01     23             112 
2018-01-02     23             112
2018-01-03     45             112
2018-01-04     34             112
2018-01-05     23             112
2018-01-06     34             112
2018-01-01     23             113
2018-01-02     34             113
2018-01-03     45             113
2018-01-04     34             113

i want result like this ? how does it possible in pandas?

CodePudding user response：

You can compute the successive delta and check if any is greater than 1d:

drop = (pd.to_datetime(df['date'])
          .groupby(df['customer_id'])
          .apply(lambda s: s.diff().gt('1d').any())
       )

out = df[df['customer_id'].isin(drop[~drop].index)]

Or with groupby.filter:

df['date'] = pd.to_datetime(df['date'])

out = (df.groupby(df['customer_id'])
         .filter(lambda d: ~d['date'].diff().gt('1d').any())
       )

Output:

          date  consumption  customer_id
6   2018-01-01           23          112
7   2018-01-02           23          112
8   2018-01-03           45          112
9   2018-01-04           34          112
10  2018-01-05           23          112
11  2018-01-06           34          112
12  2018-01-01           23          113
13  2018-01-02           34          113
14  2018-01-03           45          113
15  2018-01-04           34          113

If you the dates are not necessarily increasing, also check you cannot go back in time:

df['date'] = pd.to_datetime(df['date'])

out = (df.groupby(df['customer_id'])
         .filter(lambda d: d['date'].diff().iloc[1:].eq('1d').all())
       )