What is, esentially, DataProcSparkOperator? I found lots of information and pieces of code using it, however, I still cannot find a solid definition for it.
CodePudding user response:
I think what you are talking about is the Apache Airflow operator for submitting Spark job to Dataproc cluster. Check the Airflow doc, this introductory article, this example code.
CodePudding user response:
Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. You can see more in this documentation.
You can see more documentation about different DataProc operators here.
Airflow provides DataProcSparkOperator to submit the jobs to your dataproc cluster.
Here is the example:
submit_job = DataProcSparkOperator(
task_id='submit_job',
dataproc_spark_jars=['{{var.value.spark_bq_jar}}'],
main_class='LoadData',
arguments=[
"job_name==currency",
"data_type=={{params.thirty_raw_folder_prefix}}",
"input_path==gs://input-bucket/input-folder",
"output_path==gs://staging-bucket/staging_folder",
"week=={{dag_run.conf['week']}}",
"year=={{dag_run.conf['year']}}",
"genres=={{dag_run.conf['genres']}}"
],
files=['gs://bucket/folder/properties/loaddata.properties'],
cluster_name='{{params.cluster_name}}',
dag=dag
)
Here, the spark_bq_jar variable contains the location of your spark jar. And all the arguments are provided to tell the jar which job to run. You can see more examples in this link.