I have been learning about apache beam architecture/model, where I work one of the requirements is to create a more straightforward way of creating pipelines and to use apache beam (Java) as one of the parts of the stack.
The goal is that they can pass dynamic values within a pipeline to change it once it is running, for example.
PCollection<String> output = input.apply("filter", ParDo.do(DoFn{
// get filters from external API ["filter1","filter2"]
// return data filtered data (regex(filter1, filter2))
}))
pipeline.run();
So far from what I have read this is not possible because once the code has been compiled and is running (within a runner) the DAG has already been created and it cannot be changed dynamically as it would require to be recompiled, reconstructed and reexecuted
Am I right or is this actually possible?
Perhaps the closest option would be to use a beam-sql shell?
CodePudding user response:
It is possible to stage a pipeline as a template, and use ValueProviders to feed parameters dynamically. Take a look at Creating classic templates, as it is covered in detail. Please note that the transforms/IOs have to support it in the classic templates model.
On the other hand, if desired IOs or transforms do not support ValueProviders, or you require a dynamic pipeline, it is possible to use Flex Templates. While Flex templates are indeed more flexible, there are some caveats such as a more complex model and higher startup times.
CodePudding user response:
There's two parts to this question, that are in fact very different:
- "a more straightforward way of creating pipelines"
- "change it (the pipeline) once it is running"
Templates, mentioned above, will absolutely streamline the creation of new, similar pipelines addressing 1. Just for completeness, you may also use plain pipeline options to achieve something similar.
Regarding 2, as pointed out, you cannot change the execution graph (DAG) structurally while running. Though, the question is if you really need that. If it's all about changing configuration of parts of the DAG, such as the filters in the example mentioned in the question, that's possible. Consider such configuration as slowly changing data flowing through the pipeline. The slowly updating side inputs Beam pattern implements exactly this.
Finally, beam-sql shell isn't any different. Every query has to be parsed & translated into the respective DAG and is then executed.