I know how to download the file from cloud storage within the cloud run instance. But, I can't find the syntax for reading the file in python. I'm looking to immediately convert the csv file into a pandas dataframe, just by using pd.read_csv('testing.csv')
. So my personal code looks like,
download_blob(bucket_name, source_blob_name, 'testing.csv')
. So shouldn't I be able to do pd.read_csv('testing.csv')
within the cloud run instance? When doing it this way, I keep getting an internal server when loading the page. It seems like a simple question, but I haven't been able to find an example of it anywhere. Everything just downloads the file, I never see it used.
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your GCS object
# source_blob_name = "storage-object-name"
# The path to which the file should be downloaded
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Downloaded storage object {} from bucket {} to local file {}.".format(
source_blob_name, bucket_name, destination_file_name
)
)
CodePudding user response:
Using a filename such as 'testing.csv' means write the file to the current directory. What is the current directory? Instead, specify an absolute path to a known directory location.
Download to the /tmp/ directory, e.g. '/tmp/testing.csv'. Using file system space consumes memory as the file system is RAM-based. Make sure the Cloud Run instance has enough memory.
Excerpt from Cloud Run Container Runtime Contact:
The filesystem of your container is writable and is subject to the following behavior:
- This is an in-memory filesystem, so writing to it uses the container instance's memory.
- Data written to the filesystem does not persist when the container instance is stopped.
CodePudding user response:
download_as_bytes
is the function you're looking for if you want to load it directly to memory.
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
data = blob.download_as_bytes()
pd.read_csv(StringIO(data))
Pandas also supports reading directly from Google Cloud Storage. https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file.
So something like "gs://bucket/file"