Home > Back-end >  Beautifulsoup dealing with duplicate nodes in XML
Beautifulsoup dealing with duplicate nodes in XML

Time:02-15

I'm trying to apply Beautifulsoup to parse an XML response to list just the code values, and I've come across an issue when parent nodes are named the same as child nodes. For example, "code" is used as a parent name and also as a child node.

XML used:

<codes>
   <code>
     <id>9601</id>
     <description>Description 1</description>
     <code>C1</code>
   </code>
   <code>
     <id>9602</id>
     <description>Description 2</description>
     <code>C2</code>
   </code>
   <code>
     <id>9603</id>
     <description>Description 3</description>
     <code>C3</code>
   </code>
   <code>
     <id>9604</id>
     <description>Description 4</description>
     <code>C4</code>
   </code>
   <code>
     <id>9605</id>
     <description>Description 5</description>
     <code>C5</code>
   </code>
 </codes>

Simple python code:

from bs4 import BeautifulSoup

infile = open("response.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
codes = soup.find_all('code')

for code in codes:
    print(code.get_text())

The output is:

9601
Description 1
C1

C1

9602
Description 2
C2

C2

9603
Description 3
C3

C3

9604
Description 4
C4

C4

9605
Description 5
C5

C5

The output I want is:

C1
C2
C3
C4
C5

What's the best way to handle these situations?

CodePudding user response:

You can try the below code - it checks if there is text in the code. Note that the code uses core python lib and not an external library.

import xml.etree.ElementTree as ET

xml = '''<codes>
   <code>
     <id>9601</id>
     <description>Description 1</description>
     <code>C1</code>
   </code>
   <code>
     <id>9602</id>
     <description>Description 2</description>
     <code>C2</code>
   </code>
   <code>
     <id>9603</id>
     <description>Description 3</description>
     <code>C3</code>
   </code>
   <code>
     <id>9604</id>
     <description>Description 4</description>
     <code>C4</code>
   </code>
   <code>
     <id>9605</id>
     <description>Description 5</description>
     <code>C5</code>
   </code>
 </codes>'''
root = ET.fromstring(xml)
for code in root.findall('.//code'):
  txt = code.text.strip()
  if txt:
    print(code.text)

output

C1
C2
C3
C4
C5

CodePudding user response:

If you want to use BeautifulSoup then you can find the "id" element and refer to the parent to access all the elements of each record. Otherwise, the simplest would be to use the core XML API in Python to iterate over the top-level code elements.

soup = BeautifulSoup(contents, 'xml')
elts = soup.find_all('id')

for elt in elts:
    print(elt.parent.find('code', recursive=False).text)

Output:

C1
C2
C3
C4
C5
  • Related