Home > other >  get list of gsutil URI inside a specific bucket folder to iterate through
get list of gsutil URI inside a specific bucket folder to iterate through

Time:10-24

I'm very new to GCS and using it with python. I have GCS bucket called "my_data", and inside it there are many folders,I'm interested in folder called "ABC" and sub folder inside called "WW3". I want to get list of the gsutil URI (not the blobs) inside a specific folder inside the bucket, called "ABC", so I can open them as pandas data frame and concatenate them.

Until now I was able to get list of blobs like this(I have used this post and this video to do that):

my_bucket=storage_client.get_bucket("my_data")

# Get blobs in specific subirectory
# Get blobs in specific subirectory
blobs_specific = list(my_bucket.list_blobs(prefix='ABC/WW3/'))

>>>
#printing blob_specific gives me the blocs like this as list:
[<Blob: my_data, ABC/S3/, 12231543135681432>,...,.......]

I would like to get list of URL that looks like this:

["gs://my_data/ABC/WW3/tab1.csv","gs://my_data/ABC/WW3/tab2.csv","gs://my_data/ABC/WW3/tab3.csv"...]

So I can later open them with pandas and concatenate them.

Is there a way I can get the list URLs and not the blobs?

or, if I can somehow use the blob to concatenate the csv and read as pandas ...

Edit: I have tried to solve it by spliting the blob and then access the files and it seems like it creates list of url but it doesn't really what it looks and it is not very smart:

urls=[]

for x,y in enumerate(blobs_specific):
    first_part="gs://my_data/WW3/"
    scnd_part=str(blobs_specific[x]).split(',')[1]
    
    url=first_part scnd_part
    
    urls.append(url)

However, when I try to iterate with this list, it fails. and seems like it prints different url then what it saves:

urls[1]
>>>'gs://my_data/WW3/ ABC/tab1.csv'

#seems like it has space between the / and the "ABC" and then when I try to read it with pandas I get path not found:

file_path = urls[1]

df = pd.read_csv(file_path,
                 sep=",",
                 storage_options={"token": "my_secret_token-20g8g632vsk1.json"})

>>>
#this is a bit different than the original because I couldn't put the real name but it gets the b and o and weird characters that don't appear when I print the path....
FileNotFoundError: b/my_data/o/ WW3*BC/S1/ABCtab1.csv

CodePudding user response:

I have found a solution for this by using .lstrip() , however, if someoe have smarter solution I would like to learn:)

urls=[]

for x,y in enumerate(blobs_specific):
    first_part="gs://my_data/WW3/"
    scnd_part=str(blobs_specific[x]).split(',')[1].lstrip()
    
    url=first_part scnd_part
    
    urls.append(url)
  • it might be that the gs://my_data will be a bit different in your case, make sure you take the right path
  • Related