Home > Net >  return python lxml text as string not single item list - from xml
return python lxml text as string not single item list - from xml

Time:03-28

Using Microsoft Books.xml https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)

Sample of first entry.

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

Trying to extract all book titles from id's with an odd id. Capturing the id and title as Key, value in dictionary

So far I have this working except all titles return as a single item list.

import xml.etree.ElementTree as ET

tree = ET.parse('books.xml')
root = tree.getroot()

data = {}
for child in root.findall('book'):
    for k,v in child.items():
        for title in child.iter('title'):
            if int(v.split('k')[1]) % 2 != 0:
                if k not in data:
                    data[v] = []
                data[v].append(title.text)
            
print(data['bk101'])

Output

{'bk101': ["XML Developer's Guide"], 'bk103': ['Maeve Ascendant'], 'bk105': ['The Sundered Grail'], 'bk107': ['Splish Splash'], 'bk109': ['Paradox Lost'], 'bk111': ['MSXML3: A Comprehensive Guide']}

Desired Output

{'bk101': "XML Developer's Guide", 'bk103': 'Maeve Ascendant', 'bk105': 'The Sundered Grail', 'bk107': 'Splish Splash', 'bk109': 'Paradox Lost', 'bk111': 'MSXML3: A Comprehensive Guide'}

How can I return the title as text not as a list?

Note I can pull them from my dictionary as text with

print(data['bk101'][0])

However, would prefer them saved into the dictionary as text not extracted later.

Edit I realise its because I am creating a list as a default value when checking if key exists. However, cannot us a None type as a placeholder to avoid the list side effect.

Realise I probably should be using fromkeys as in this answer SO to initialising a dict with keys and empty value

But how do I do this in the loop?

CodePudding user response:

If you don't expect multiple <title> tags per book, there is no need to use a list and you can just assign the value of title.text rather than appending it. Additionally, iterating over child.items() is unnecessary when you know you specifically want the id attribute. It could have caused problems if there were other attributes, since they won't be in the same format to get split.

The simplified code based on the assumption that every <book> has an id and a <title> child (like in your sample XML) is as follows:

for child in root.findall('book'):
    book_id = child.get('id')
    if int(book_id.split('k')[1]) % 2 != 0:
        data[book_id] = child.find('title').text

print(data)

This gives the output:

{'bk101': "XML Developer's Guide", 'bk103': 'Maeve Ascendant', 'bk105': 'The Sundered Grail', 'bk107': 'Splish Splash', 'bk109': 'Paradox Lost', 'bk111': 'MSXML3: A Comprehensive Guide'}

If it is possible for <title> to be missing, find() can return None, so an additional if condition would be needed.

If you expect multiple <title> tags per <book>, it would be better to have a list and use child.iter('title') like in your question. This would implicitly handle the missing title case too, since the code inside the loop won't run.

  • Related