Tensorflow Data Validation provides a way to find anomalies in your data.
However, I am able to find only a way to provide a summarized version of the anomalies (by using tfdv.validate_statistics
and tfdv.display_anomalies
).
Is there a functionality of some param to pass that instead of reporting the summary, it returns the rows with the anomaly and what anomaly type?
Following the example below:
import pandas as pd
import tensorflow_data_validation as tfdv
from tensorflow_metadata.proto import schema_pb2
df_stats = tfdv.generate_statistics_from_dataframe(df)
schema = tfdv.infer_schema(statistics=df_stats)
tfdv.set_domain(schema, "c1", schema_pb2.IntDomain(min=1, max=3))
anomalies = tfdv.validate_statistics(statistics=df_stats, schema=schema)
tfdv.display_anomalies(anomalies)
Is there a way to leverage TFDV to return something like:
index | c1 | c2 | anomaly_type |
---|---|---|---|
3 | 100 | Z | c1 Out-of-range values |
4 | 100000 | A | c1 Out-of-range values |
If not, what alternative would you recommend?
CodePudding user response:
No you can not. that's because it is the stats that are being validated and not the actual data. For the c1 column, tfdv compare min and max values found in stats with min and max values found in schema. that implies :
- tfdv is unaware if there is other values that are out of range (eg. 100)
- tfdv cannot return the index of the rows where the anomaly has been detected since it does not have this information
check this for more : https://www.tensorflow.org/tfx/data_validation/anomalies?hl=en