Accessing and using csv file from Cloud Storage in Cloud Run instance-CodePudding

I know how to download the file from cloud storage within the cloud run instance. But, I can't find the syntax for reading the file in python. I'm looking to immediately convert the csv file into a pandas dataframe, just by using pd.read_csv('testing.csv'). So my personal code looks like, download_blob(bucket_name, source_blob_name, 'testing.csv'). So shouldn't I be able to do pd.read_csv('testing.csv') within the cloud run instance? When doing it this way, I keep getting an internal server when loading the page. It seems like a simple question, but I haven't been able to find an example of it anywhere. Everything just downloads the file, I never see it used.



def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your GCS object
    # source_blob_name = "storage-object-name"

    # The path to which the file should be downloaded
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Downloaded storage object {} from bucket {} to local file {}.".format(
            source_blob_name, bucket_name, destination_file_name
        )
    )

CodePudding user response：

Using a filename such as 'testing.csv' means write the file to the current directory. What is the current directory? Instead, specify an absolute path to a known directory location.

Download to the /tmp/ directory, e.g. '/tmp/testing.csv'. Using file system space consumes memory as the file system is RAM-based. Make sure the Cloud Run instance has enough memory.

Excerpt from Cloud Run Container Runtime Contact:

The filesystem of your container is writable and is subject to the following behavior:

This is an in-memory filesystem, so writing to it uses the container instance's memory.
Data written to the filesystem does not persist when the container instance is stopped.

Reference: Filesystem access

CodePudding user response：

download_as_bytes is the function you're looking for if you want to load it directly to memory.

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
data = blob.download_as_bytes()
pd.read_csv(StringIO(data))

https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.download_as_bytes

Pandas also supports reading directly from Google Cloud Storage. https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file.

So something like "gs://bucket/file"