Writing BigQuery Table from PySpark Dataframe using Dataproc Servereless-CodePudding

Summary: I'm unable to get a Spark Dataframe and write it to BigQuery Table.

Detail: I'm running a serverless batch job in Dataproc, reading data directly from Bigquery / Cloud Storage, doing some feature engineering and persisting it on BigQuery. However in one routine I'm unable to retrieve the Dataframe and is accusing:

22/10/08 08:13:21 WARN BigQueryDataSourceWriterInsertableRelation: It seems that 184 out of 16 partitions have failed, aborting
22/10/08 08:13:21 WARN BigQueryDirectDataSourceWriterContext: BigQuery Data Source writer aedb4dc8-28c5-4118-9dcc-de2ef689e75c aborted

Not quite sure but it seems like my Spark configuration is unable to deal with the process. My config is:

--properties spark.executor.instances=10,spark.driver.cores=16,spark.executor.cores=16

My writing code:

user_item_interaction_df.write.format("bigquery").option("writeMethod", "direct").mode("overwrite").save()

BTW I'm running some Joins and Crossjoins which I know is expensive. The final output table should have something like 33M.

Any idea is welcome!

CodePudding user response：

Judging from the com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Exceeds 'AppendRows throughput' quota, user_id: ... (status: INSUFFICIENT_TOKENS), you can issue a raise quota request through Google Cloud Console. Be sure to include this full error message in the request description. Entity: projects/... error (see GH issue for more details) it seems that your Spark job is hitting BigQuery too hard.

You can try 2 things to fix this:

Request more quota as suggested in the exception
Limit the scale of the job by decreasing max number of executors with spark.dynamicAllocation.maxExecutors=100 property, for example. Note that default value for this property is 1000.