I currently have a system in which I want to send data from that system via a Google Cloud Function to BigQuery (using Python). My idea was to do this in two simple steps:
- The system calls a Google Cloud Function and sends JSON data to that function
- The Google Cloud Function unpacks the data and stores it inside a Google BigQuery table using the pandas-gbq package.
I thought it was a good idea, until I realized there were rate limits on Google BigQuery regarding adding data (see https://community.microstrategy.com/s/article/Known-Limitations-of-Adding-Data-into-Google-BigQuery?language=en_US for example).
Because I want to use this system to trigger more than a thousand times a day, I think I need a better setup.
What would you recommend me to do?
I already thought of two things:
- Using Google Datastore for adding data, but I am not sure whether that has the same kind of rate limits and if I can easily transfer that data to BigQuery
- Instead of sending data on a case-by-case basis, collect (for example) 500 calls to a Google Cloud Function. Once the data of 500 cases is received, send those 500 cases in one go to BigQuery. I am not sure whether this will work and if it's even possible to do this with Google Cloud Functions.
CodePudding user response:
The mentioned limits only apply to load jobs.
You can use streaming inserts instead, which do not have those limits.
I am not sure if pandas-gbq uses load jobs or streaming inserts under the hood.
Anyway, you can also use the BigQuery Python Client and the method insert_rows_from_dataframe which uses the streaming API.
CodePudding user response:
For your requirement, you can use Google provided Dataflow Templates that contain templates where you can use Pub/Sub to stream data into BigQuery using Dataflow. A streaming pipeline is used to read the JSON formatted data from Cloud Pub/Sub and then write it to BigQuery.
You can choose either Pub/Sub Topic to BigQuery template or Pub/Sub Subscription to BigQuery template according to the requirement. It will incur some cost in BigQuery for data ingestion, for which you can check the pricing given in this document.