I have a pyspark job running in Dataproc. Currently, we are logging to console/yarn logs. As per our requirement, we need to store the logs in GCS bucket. Is there a way to directly log to files in GCS Bucket with python logging module?
I have tried to set the logging module with below config. But it's throwing an error (FileNotFoundError: [Errno 2] No such file or directory: '/gs:/bucket_name/newfile.log')
logging.basicConfig(filename="gs://bucket_name/newfile.log", format='%(asctime)s %(message)s', filemode='w')
CodePudding user response:
By default, yarn:yarn.log-aggregation-enable
is set to true and yarn:yarn.nodemanager.remote-app-log-dir
is set to gs://<cluster-tmp-bucket>/<cluster-uuid>/yarn-logs
on Dataproc 1.5 , so YARN container logs are aggregated in the GCS dir, but you can update it with
gcloud dataproc clusters create ... \
--properties yarn:yarn.nodemanager.remote-app-log-dir=<gcs-dir>
or update the tmp bucket of the cluster with
gcloud dataproc clusters create ... --temp-bucket <bucket>
Note that
If your Spark job is in client mode (the default), the Spark driver runs on master node instead of in YARN, driver logs are stored in the Dataproc-generated job property
driverOutputResourceUri
which is a job specific folder in the cluster's staging bucket. Otherwise, in cluster mode, the Spark driver runs in YARN, the driver logs are YARN container logs and are aggregated as described above.If you want to disable Cloud Logging for your cluster, set
dataproc:dataproc.logging.stackdriver.enable=false
. But note that it will disable all types of Cloud Logging logs including YARN container logs, startup and service logs.