Using Microsoft Books.xml https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)
Sample of first entry.
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
Trying to extract all book titles from id's with an odd id. Capturing the id and title as Key, value in dictionary
So far I have this working except all titles return as a single item list.
import xml.etree.ElementTree as ET
tree = ET.parse('books.xml')
root = tree.getroot()
data = {}
for child in root.findall('book'):
for k,v in child.items():
for title in child.iter('title'):
if int(v.split('k')[1]) % 2 != 0:
if k not in data:
data[v] = []
data[v].append(title.text)
print(data['bk101'])
Output
{'bk101': ["XML Developer's Guide"], 'bk103': ['Maeve Ascendant'], 'bk105': ['The Sundered Grail'], 'bk107': ['Splish Splash'], 'bk109': ['Paradox Lost'], 'bk111': ['MSXML3: A Comprehensive Guide']}
Desired Output
{'bk101': "XML Developer's Guide", 'bk103': 'Maeve Ascendant', 'bk105': 'The Sundered Grail', 'bk107': 'Splish Splash', 'bk109': 'Paradox Lost', 'bk111': 'MSXML3: A Comprehensive Guide'}
How can I return the title as text not as a list?
Note I can pull them from my dictionary as text with
print(data['bk101'][0])
However, would prefer them saved into the dictionary as text not extracted later.
Edit I realise its because I am creating a list as a default value when checking if key exists. However, cannot us a None type as a placeholder to avoid the list side effect.
Realise I probably should be using fromkeys as in this answer SO to initialising a dict with keys and empty value
But how do I do this in the loop?
CodePudding user response:
If you don't expect multiple <title>
tags per book, there is no need to use a list and you can just assign the value of title.text
rather than appending it. Additionally, iterating over child.items()
is unnecessary when you know you specifically want the id
attribute. It could have caused problems if there were other attributes, since they won't be in the same format to get split.
The simplified code based on the assumption that every <book>
has an id
and a <title>
child (like in your sample XML) is as follows:
for child in root.findall('book'):
book_id = child.get('id')
if int(book_id.split('k')[1]) % 2 != 0:
data[book_id] = child.find('title').text
print(data)
This gives the output:
{'bk101': "XML Developer's Guide", 'bk103': 'Maeve Ascendant', 'bk105': 'The Sundered Grail', 'bk107': 'Splish Splash', 'bk109': 'Paradox Lost', 'bk111': 'MSXML3: A Comprehensive Guide'}
If it is possible for <title>
to be missing, find()
can return None
, so an additional if condition would be needed.
If you expect multiple <title>
tags per <book>
, it would be better to have a list and use child.iter('title')
like in your question. This would implicitly handle the missing title case too, since the code inside the loop won't run.