I have previously managed to set up a progress bar with tdqm for a simple for-loop successfully, but am now trying to do something slightly different:
I have an xml-file with several items in it that I am parsing to a function to extract specific information which I then convert to a dataframe. So I have a function that looks roughly like this:
def parse_record(xml):
ns = {"marc":"http://www.loc.gov/MARC21/slim"}
#ID:
id = xml.findall("marc:controlfield[@tag = '001']", namespaces=ns)
try:
id = id[0].text
except:
id = 'fail'
#Creator:
creator = xml.findall("marc:datafield[@tag = '100']/marc:subfield[@code = 'a']",
namespaces=ns)
if creator:
creator = creator[0].text
else:
creator = "fail"
gathered = {'ID':id, 'Creator':creator}
return gathered
I then call this function looping through all the single items in the main xml-file and convert it to a dataframe:
result = [parse_record(item) for item in records]
df = pd.DataFrame(result)
df
This all works fine, but I am not sure how to manage to get a progress bar included into the whole thing, since the for-loop isn't on its own.
If I add the tdqm bit to the function, it obviously only ever counts to 1, but does this hundreds of times (depending on how many items the xml-file includes). I haven't managed to include it to the parsing part.
Any help would be much appreciated!
CodePudding user response:
You pretty much just need to break up your list comprehension. I'll use Enlighten here but you can accomplish the same thing with tqdm.
import enlighten
records: list = ...
manager = enlighten.get_manager()
pbar = manager.counter(total=len(records), desc='Parsing records', unit='records')
result = []
for item in records:
result.append(parse_record(item))
pbar.update()
df = pd.DataFrame(result)
If records
is a generator not an iterable, you'll need to wrap it with list()
or tuple()
first so you can get the length.