How to check if a pattern of rows is missing in a pandas dataframe in Python-CodePudding

I have the below sample dataframe. Each hour has 5 instances. Is there any module or a way to find the missing data in a given column in Python. For instance for hour 2, instance 3 is missing. How can we identify this missing instance in a larger dataset dynamically in Python and print a message that an instance is missing.

Date          Hour  Instance 
2022-10-20     1      1 
2022-10-20     1      2
2022-10-20     1      3
2022-10-20     1      4
2022-10-20     1      5
2022-10-20     2      1
2022-10-20     2      2 
2022-10-20     2      4
2022-10-20     2      5

Thank you.

CodePudding user response：

(df.set_index(["Hour", "Instance"])
   .unstack()
   .isna().where(lambda fr: fr)
   .stack()
   .reset_index()
   [["Hour", "Instance"]])

move Hour & Instance pairs to the index
then unstack the last one, i.e., Instance
- this will have a reindexing effect on hours: all of them will have 1 to 5
- and the missing ones will be marked as NaN
now mask the non-NaNs to become NaN, and NaNs to be True...
...because when stack'ed, NaNs will go and only original missings will present
move Hour & Instance back to columns and select them to show the desired pairs

to get

   Hour  Instance
0     2         3

Meaning, there was only 1 instance missing, and it was Hour 2's Instance 3.

CodePudding user response：

Using a crosstab:

df2 = pd.crosstab(df['Hour'], df['Instance'])

out = df2[df2.eq(0)].stack().reset_index()[['Hour', 'Instance']]

Output:

   Hour  Instance
0     2         3

CodePudding user response：

First, define a function that checks if there is a missing instance.

def is_data_missing(array):
    """Return True when data is missing, ie array is different from range."""
    return list(array) != list(range(1, len(array)   1))

Then you can apply it to your DataFrame, grouping by hour first.

>>> df.groupby("Hour").apply(lambda x: is_range(x["Instance"].values))
Hour
1    False 
2     True

The resulting DataFrame gives you the hours when data was missing. You can then iterate through the items to print a message.

>>> for hour, is_missing in df_missing.items():
...     if is_missing:
...         print(f"At hour {hour} data is missing")
At hour 2 data is missing