Home > Net >  How do I increase speed of XML retrieval and parsing in S3 Jupyter Notebook (SageMaker Studio)?
How do I increase speed of XML retrieval and parsing in S3 Jupyter Notebook (SageMaker Studio)?

Time:11-23

I am managing data for a computer vision project and am looking for a fast way to search and manipulate all the files in a given directory. I have a working solution but am only able to process maybe 10-20 files per second. I am new to Jupyter Notebooks so am looking for recommendations on increasing the efficiency of the attached code.

Current code is as follows:

car_count=0
label_dict={}
purge_list=[]
for each_src in source_keys:
    pages = paginator.paginate(Bucket=src_bucket, Prefix=each_src)
    for page in pages:
        for obj in page['Contents']:
            fpath = obj['Key']
            fname = fpath.split('/')[-1]
            if fname == '':
                continue
            copy_source = {
                'Bucket': src_bucket,
                'Key': fpath
            }
            if fname.endswith('.xml'):
                obj=s3.Object(src_bucket,fpath)
                data=obj.get()['Body'].read()
                root = ET.fromstring(data)
                for box in root.findall('object'):
                    name=box.find('name').text
                    if name in label_dict:
                        label_dict[name]  =1
                    else :
                        label_dict[name] = 1
                    if name not in allowed_labels:
                        purge_list.append(fpath)
                print(f'Labels: {label_dict}',end='\r')
    print(f'\nTotal Images files:{i}, Total XML files:{j}',end='\r')
#print(f'\nLabels: {label_dict})
print(f'\nPURGE LIST: ({len(purge_list)} files)')

Possible solutions:

Multithreading - I have done threading in normal Python 3.x is it common to multithread within a notebook? Read Less of File - Currently read in whole file, not sure if this is a large bog down point but may increase speed.

CodePudding user response:

Jupyter usually has a lot of overhead - also your syntax has three levels of for loops. In the python world, the lesser the for loops the better - also, binary data is almost always faster. So, a number of suggestions:

  • restructure your for loops, use some specialized lib from pypi for file fs
  • change language? use bash script
  • multi threading is a way indeed
  • caching, use redis or other fast data structures to "read-in" data
  • golang is comparatively easier to jump from python, also has good multi threading support - my 2 cents: its worth a try at least
  • Related