Find missing numbers in a column dataframe pandas-CodePudding

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:

df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']

    Store   Invoice
0   A   1
1   A   2
2   A   5
3   A   6
4   A   8
5   B   20
6   B   23
7   B   24
8   B   30
9   C   200
10  C   202
11  C   203
12  D   204
13  D   206

And I want a dataframe like this:

    Store   MissInvoice
0   A   3
1   A   4
2   A   7
3   B   21
4   B   22
5   B   25
6   B   26
7   B   27
8   B   28
9   B   29
10  C   201
11  D   205

Thanks in advance!

CodePudding user response：

You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:

(df1.astype({'Invoice': int})
    .groupby('Store')['Invoice']
    .apply(lambda s: set(range(s.min(), s.max())).difference(s))
    .explode().reset_index()
)

Output:

   Store Invoice
0      A       3
1      A       4
2      A       7
3      B      21
4      B      22
5      B      25
6      B      26
7      B      27
8      B      28
9      B      29
10     C     201

CodePudding user response：

Here's an approach:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)

df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
    df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max'] 1), 
                                  df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()

The resulting dataframe df2:

   Store MissInvoice
0      A           3
1      A           4
2      A           7
3      B          21
4      B          22
5      B          25
6      B          26
7      B          27
8      B          28
9      B          29
10     C         201

Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.