Dockerized Apache Beam returns "No id provided"-CodePudding

I've hit a problem with dockerized Apache Beam. When trying to run the container I am getting "No id provided." message and nothing more. Here's the code and files:

Dockerfile

FROM apache/beam_python3.8_sdk:latest
RUN apt update
RUN apt install -y wget curl unzip git
COPY ./ /root/data_analysis/
WORKDIR /root/data_analysis
RUN python3 -m pip install -r data_analysis/beam/requirements.txt
ENV PYTHONPATH=/root/data_analysis
ENV WORKER_ID=1
CMD python3 data_analysis/analysis.py

Code analysis.py :

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def run():
    options = PipelineOptions(["--runner=DirectRunner"])

    with beam.Pipeline(options=options) as p:
        p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x-1) | beam.Map(print)

if __name__ == "__main__":
    run()

Commands:

% docker build -f Dockerfile_beam -t beam .
[ ] Building 242.2s (12/12) FINISHED                                                                                                                                                                                          
...

% docker run --name=beam beam   
2021/09/15 13:44:07 No id provided.

I found that this error message is most likely generated by this line: https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/python/container/boot.go#L98

But what does it mean? Which id is this? What am I missing?

CodePudding user response：

This error is most likely happening due to your Docker image being based on the SDK harness image (apache/beam_python3.8_sdk). SDK harness images are used in portable pipelines; When a portable runner needs to execute stages of a pipeline that must be executed in their original language, it starts a container with the SDK harness and delegates execution of that stage of the pipeline to the SDK harness. Therefore, when the SDK harness boots up it is expecting to have various configuration details provided by the runner that started it, one of which is the ID. When you start this container directly, those configuration details are not provided and it crashes.

For context into your specific use-case, let me first diagram out the different processes involved in running a portable pipeline.

Pipeline Construction <---> Job Service <---> SDK Harness
                                         \--> Cross-Language SDK Harness

Pipeline Construction - The process where you define and run your pipeline. It sends your pipeline definition to a Job Service and receives a pipeline result. It does not execute any of the pipeline.
Job Service - A process for your runner of choice. This is potentially in a different language than your original pipeline construction, and can therefore not run user code, such as custom DoFns.
SDK Harness - A process that executes user code, initiated and managed by the Job Service. By default, this is in a docker container.
Cross-Language SDK Harness A process executing code from a different language than your pipeline construction. In your case, Python's Kafka IO uses cross-language, and is actually executing in a Java SDK harness.

Currently, the docker container you created is based on an SDK harness container, which does not sound like what you want. You seem to have been trying to containerize your pipeline construction code and accidentally containerized the SDK harness instead. But since you described that you want the ReadFromKafka consumer to be containerized, it sounds like what you need is for the Job Server to be containerized, in addition to any SDK harnesses it uses.

Containerizing the Job Server is possible, and may already be done. For example, here's a containerized Flink Job Server. Containerized job servers may give you a bit of trouble with artifacts, as the container won't have access to artifact staging directories on your local machine, but there may be ways around that.

Additionally, you mentioned that you want to avoid having SDK harnesses start in a nested docker container. If you start up a worker pool docker container for the SDK harness and set it as an external environment, the runner, assuming it supports external environments, will attempt to connect to the URL you supply instead of creating a new docker container. You will need to configure this for the Java cross-language environment as well, if that is possible in the Python SDK. This configuration should be done via python's pipeline options. --environment_type and --environment_options are good starting points.