Imagine we have a Data source, which can be a blob storage or table.
When new data comes into the data source, the main objective is to create a mechanism so that we can first check the data quality of the new data using certain statistical tests, then if it passes these tests, we should be able to combine the new data with the previous data source. The Data source must be versioned.
Also if the new data fails the statistical tests, then we should have a mechanism where it alerts a developer, then if the developer decides to override and then we should be able to combine the new data with the previous data source.
This specific part must be triggered manually, the starting point where we check the new delta. After doing so we need to trigger an Azure DevOps Pipeline.
What tools can we use for this? Are there any reference guides we can follow for this? I need to implement this in Azure.
Key Concerns:
- Dataset: Being able to version.
- Way to detect delta and store it in a separate place before the tests.
- Way to allow a developer to have an override.
- Performing statistical tests.
CodePudding user response:
Assuming that the steps in your entire workflow can be broken down into discreet steps, are relatively idempotent or can be checkpointed at each step, and are not long running, then yes, you could explore using Durable Functions, an advanced orchestration framework for Azure Functions.
Suggestions to match your goals:
- Dataset: Being able to version - You should version this explicitly in your dataset during generation. If that's not feasible, you may derive the version based on a hash of a combination of metadata for the dataset.
- Way to detect delta and store it in a separate place before the tests - depends on what delta means to your dataset. You can make the code check the previous hash from an entry in table storage, and compare with the current hash.
- Way to allow a developer to have an override - Yes, see Human interaction in Durable Functions.
- Performing statistical tests - If there are multiple tests to run for each pass, then consider using the fan-out/fan-in pattern.