I am experimenting with GCP. I have a local environment with Hadoop. It consists of files stored on HDFS and a bunch of python scripts which make API calls and trigger pig jobs. These python jobs are scheduled via cron.
I want to understand the best way to do something similar in GCP. I understand that I can use GCS as an HDFS replacement. And that Dataproc can be used to spin up Hadoop Clusters and run Pig jobs.
Is it possible to store these Python scripts into GCS, have a cron like schedule to spin up Hadoop clusters, and point to these Python scripts in GCS to run?
CodePudding user response:
If you are looking for a cron job or workflow scheduler on GCP, consider:
Cloud Scheduler which is a fully managed enterprise-grade cron job scheduler;
Cloud Workflows which combines Google Cloud services and APIs to easily build reliable applications, process automation, and data and machine learning pipelines.
Cloud Composer which is a fully managed workflow orchestration service built on Apache Airflow.
Cloud Scheduler is the simplest one, but might be the best for your use case. Cloud Workflows has some overlap with Cloud Composer, see their key differences and how to choose in this doc.
CodePudding user response:
I discovered that you can use Dataproc to run Python scripts through a 'submit pig' job. This job allows you to run Bash scripts, from which you can call Python scripts:
gcloud dataproc jobs submit pig --cluster=test-python-exec --region=us-central1 -e='fs -cp -f gs://testing_dataproc/main/execution/execute_python.sh file:///tmp/execute_python.sh; sh chmod 750 /tmp/execute_python.sh; sh /tmp/execute_python.sh'