Is there a way to automate this Python script in GCP?-CodePudding

I am a complete beginner in using GCP functions/products. I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.

import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage

list_city = []
with open("list_of_cities.txt", "r") as pointer:
    for line in pointer:
        list_city.append(line.strip())

API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="

yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")

table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"

credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path

client = bigquery.Client()

try:
    schema = [
        bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
        bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
        bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
        bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
        bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
        bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
        bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
        bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
    ]


    table = bigquery.Table(table_id, schema=schema)
    table.time_partitioning = bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="Date",  # name of column to use for partitioning
    )
    table = client.create_table(table)  # Make an API request.
    print(
        "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
    )
except:
    print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))

    
def get_weather():
    try:
        x["location"]
    except:
        print(f"API could not call city {city_name}")
        
    global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km    
    
    day = []
    time = []
    dailytemp = []
    dailyhum = []
    dailycond = []
    chance_rain = []
    Precipitation = []
    Cloud_coverage = []
    Visibility_km = []
    
    for i in range(24):
        dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
        timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
       
        day.append(dayval.group()[:-1])
        time.append(timeval.group()[1:])
        dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
        dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
        dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
        chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
        Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
        Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
        Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
    for i in range(len(time)):
        time[i] = int(time[i][:2])

def main():
    i = 0
    while i < len(list_city):
        try:
            global city_name
            city_name = list_city[i]
            complete_URL = Base_URL   API_key   "&q="   city_name   "&dt="   Date
            response = requests.get(complete_URL, timeout = 10)
            global x
            x = response.json()

            get_weather()
            table = client.get_table(table_id)
            varlist = []
            for j in range(24):
                variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
                varlist.append(variables)
                
            client.insert_rows(table, varlist)
            print(f"City {city_name}, ({i 1} out of {len(list_city)}) successfully inserted")
            i  = 1
        except Exception as e:
            print(e)
            continue

In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.

This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.

I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.

CodePudding user response：

There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.

It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:

Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run

Based on my understanding of your application design, the most suitable ones would be:

Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style

I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account [email protected] under the IAM configuration section has proper access to needed APIs (BigQuery in your case).

In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.

CodePudding user response：

For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.

You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity

This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.

If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.

If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()

https://cloud.google.com/bigquery/docs/authentication/service-account-file

You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)