I have the below sample dataframe. Each hour has 5 instances. Is there any module or a way to find the missing data in a given column in Python. For instance for hour 2, instance 3 is missing. How can we identify this missing instance in a larger dataset dynamically in Python and print a message that an instance is missing.
Date Hour Instance
2022-10-20 1 1
2022-10-20 1 2
2022-10-20 1 3
2022-10-20 1 4
2022-10-20 1 5
2022-10-20 2 1
2022-10-20 2 2
2022-10-20 2 4
2022-10-20 2 5
Thank you.
CodePudding user response:
(df.set_index(["Hour", "Instance"])
.unstack()
.isna().where(lambda fr: fr)
.stack()
.reset_index()
[["Hour", "Instance"]])
- move Hour & Instance pairs to the index
- then unstack the last one, i.e., Instance
- this will have a reindexing effect on hours: all of them will have 1 to 5
- and the missing ones will be marked as NaN
- now mask the non-NaNs to become NaN, and NaNs to be True...
- ...because when stack'ed, NaNs will go and only original missings will present
- move Hour & Instance back to columns and select them to show the desired pairs
to get
Hour Instance
0 2 3
Meaning, there was only 1 instance missing, and it was Hour 2's Instance 3.
CodePudding user response:
Using a crosstab
:
df2 = pd.crosstab(df['Hour'], df['Instance'])
out = df2[df2.eq(0)].stack().reset_index()[['Hour', 'Instance']]
Output:
Hour Instance
0 2 3
CodePudding user response:
First, define a function that checks if there is a missing instance.
def is_data_missing(array):
"""Return True when data is missing, ie array is different from range."""
return list(array) != list(range(1, len(array) 1))
Then you can apply it to your DataFrame, grouping by hour first.
>>> df.groupby("Hour").apply(lambda x: is_range(x["Instance"].values))
Hour
1 False
2 True
The resulting DataFrame gives you the hours when data was missing. You can then iterate through the items to print a message.
>>> for hour, is_missing in df_missing.items():
... if is_missing:
... print(f"At hour {hour} data is missing")
At hour 2 data is missing