Currently, I am working on a Python application that searches for a blob in a container, given a keyword. My code for searching the blob is found below. When performing the search in very large blob containers, this current method is not very effective as it takes over 20 minutes to search for a blob (for a blob container containing ~ 1,100,000 blobs). In addition, my application 'freezes' and is not clickable until the search is finished.
I recently started reading about multi-threading, and starting thinking about how it could be used in my application to speed up the search process. Since my current search is using a single thread, would it somehow be possible to use multiple threads to complete the search?
An idea I currently have is to somehow get the total count of blobs that the generator holds, and assign one half of it to one thread to search, and assign the other half to another thread to search. So in the end, multiple threads would be performing the search to ultimately complete the entire search faster. Any ideas, tips or recommendations would be most helpful.
next_marker = None
while True:
generator = container_client.list_blobs(marker=next_marker)
for item in generator:
if search_keyword in item.name:
print("Container: {0}, Blob: {1}\n".format(container_client.container_name, item.name))
# Using next_marker to get continuous token and the rest of the blob result
if not next_marker:
break
next_marker = generator.next_marker
CodePudding user response:
Not a perfect solution but one possible way to parallelize the listing operation is to make use of blob prefix. Assuming your blob names start with alpha-numeric characters (a-z, A-Z and 0-9), what you could do is do a blob prefix search in parallel where in each thread you search for blobs names of which start with certain prefix ("a", "b", .... etc.).
You would use list_blobs
with name_starts_with
parameter and provide the prefix there.
Other option would be to make use of Azure Cognitive Search and create an Index which makes use of Azure Blob Storage kind of Data Source. It will be much faster however you are using a different search all together to do blob search.