Home > Software engineering >  Python: Progress bar in parse function?
Python: Progress bar in parse function?

Time:04-15

I have previously managed to set up a progress bar with tdqm for a simple for-loop successfully, but am now trying to do something slightly different:

I have an xml-file with several items in it that I am parsing to a function to extract specific information which I then convert to a dataframe. So I have a function that looks roughly like this:

def parse_record(xml):
      
    ns = {"marc":"http://www.loc.gov/MARC21/slim"}

    #ID:      
    id = xml.findall("marc:controlfield[@tag = '001']", namespaces=ns)
    try:
        id = id[0].text
    except:
        id = 'fail'
        
    #Creator: 
    creator = xml.findall("marc:datafield[@tag = '100']/marc:subfield[@code = 'a']", 
         namespaces=ns)

    if creator:
        creator = creator[0].text
    else:
        creator = "fail"

    gathered = {'ID':id, 'Creator':creator}
    
    return gathered

I then call this function looping through all the single items in the main xml-file and convert it to a dataframe:

result = [parse_record(item) for item in records]
df = pd.DataFrame(result)
df

This all works fine, but I am not sure how to manage to get a progress bar included into the whole thing, since the for-loop isn't on its own.

If I add the tdqm bit to the function, it obviously only ever counts to 1, but does this hundreds of times (depending on how many items the xml-file includes). I haven't managed to include it to the parsing part.

Any help would be much appreciated!

CodePudding user response:

You pretty much just need to break up your list comprehension. I'll use Enlighten here but you can accomplish the same thing with tqdm.

import enlighten

records: list = ...

manager = enlighten.get_manager()
pbar = manager.counter(total=len(records), desc='Parsing records', unit='records')

result = []
for item in records:
    result.append(parse_record(item))
    pbar.update()

df = pd.DataFrame(result)

If records is a generator not an iterable, you'll need to wrap it with list() or tuple() first so you can get the length.

  • Related