I am managing data for a computer vision project and am looking for a fast way to search and manipulate all the files in a given directory. I have a working solution but am only able to process maybe 10-20 files per second. I am new to Jupyter Notebooks so am looking for recommendations on increasing the efficiency of the attached code.
Current code is as follows:
car_count=0
label_dict={}
purge_list=[]
for each_src in source_keys:
pages = paginator.paginate(Bucket=src_bucket, Prefix=each_src)
for page in pages:
for obj in page['Contents']:
fpath = obj['Key']
fname = fpath.split('/')[-1]
if fname == '':
continue
copy_source = {
'Bucket': src_bucket,
'Key': fpath
}
if fname.endswith('.xml'):
obj=s3.Object(src_bucket,fpath)
data=obj.get()['Body'].read()
root = ET.fromstring(data)
for box in root.findall('object'):
name=box.find('name').text
if name in label_dict:
label_dict[name] =1
else :
label_dict[name] = 1
if name not in allowed_labels:
purge_list.append(fpath)
print(f'Labels: {label_dict}',end='\r')
print(f'\nTotal Images files:{i}, Total XML files:{j}',end='\r')
#print(f'\nLabels: {label_dict})
print(f'\nPURGE LIST: ({len(purge_list)} files)')
Possible solutions:
Multithreading - I have done threading in normal Python 3.x is it common to multithread within a notebook? Read Less of File - Currently read in whole file, not sure if this is a large bog down point but may increase speed.
CodePudding user response:
Jupyter usually has a lot of overhead - also your syntax has three levels of for
loops. In the python world, the lesser the for
loops the better - also, binary
data is almost always faster. So, a number of suggestions:
- restructure your for loops, use some specialized lib from pypi for file fs
- change language? use
bash
script - multi threading is a way indeed
- caching, use
redis
or other fast data structures to "read-in" data golang
is comparatively easier to jump from python, also has good multi threading support - my 2 cents: its worth a try at least