In PySpark, is there a way to pass credentials as variables into spark.read?-CodePudding

Spark allows us to read directly from Google BigQuery, as shown below:

df = spark.read.format("bigquery") \
  .option("credentialsFile", "googleKey.json") \
  .option("parentProject", "projectId") \
  .option("table", "project.table") \
  .load()

However having the key saved on the virtual machine, isn't a great idea. I have the Google key saved as JSON securely in a credential management tool. The key is read on-demand and saved into a variable called googleKey.

Is it possible to pass JSON into speak.read, or pass in the credentials as a Dictionary?

CodePudding user response：

The other option is credentials. From spark-bigquery-connector docs:

How do I authenticate outside GCE / Dataproc?

Credentials can also be provided explicitly, either as a parameter or from Spark runtime configuration. They should be passed in as a base64-encoded string directly.
// Globally
spark.conf.set("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")
// Per read/Write
spark.read.format("bigquery").option("credentials", "<SERVICE_ACCOUNT_JSON_IN_BASE64>")

CodePudding user response：

This is more like chicken and egg situation. if you are storing credential file in secret manager (hope that's not your credential manager tool). How would you access secret manager. For that you might need key and where would you store that key.

For this, Azure has created a managed identities, through which two different services can talk to each other without providing any keys (credential) explicitly.