I am using XML, BeuatifulSoup, and Python to parse data. In this specific XML document, there are multiple children that have different tag names with different values, but all have the same child name.
I have attached an image of the directory and how the layout exists. I am trying to get the value for Occupation, MIBCarrierCode, MIBTestIndicator, and so on.
Schema of Parent, Child and Children directory
with open("arc45.xml", 'r') as file:
data = file.read()
Bs_data = BeautifulSoup(data, "xml")
Occupation = Bs_data.find_all("Name")
print(Occupation)
Output:
[-Name-Occupation-/Name-, -Name-CarrierCode-/Name-, -Name-TestIndicator-/Name-, -Name-LineOfBusinessCode-/Name-]
This only gives me the first tag "Name" but I need to grab the value and have it equal to an Occupation variable.
If I say value I receive this output:
Occupation = Bs_data.find_all("Value")
print(Occupation)
Output:
[-Value-Unknown-/Value-, -Value-111-/Value-, -Value-0-/Value-, -Value-1-/Value-]
I need to grab the value when the tag is Occupation or CarrierCode, and so on.
This is an example of the layout of the XML file.
-AdditonalAttributes- -Attribute- -Name- Occupation -Name- -Value- Unknown -Value- / -Attribute- -Attribute- -Name- CarrierCode -Name- -Value- 656 -Value-
- All - symbols should be replaced with >, for the sake of showing the XML format without the symbols disappearing.
Just not quite sure how to parse this information.
CodePudding user response:
You could select()
or find_all()
of the <Attribute>
and check its <Name>
against a whitelist or what ever you need to, while iterating the ResultSet
:
for a in soup.select('Attribute'):
if a.Name.get_text(strip=True) in ['Occupation','CarrierCode']:
print(a.Value.get_text(strip=True))
Example
from bs4 import BeautifulSoup
xml='''
<AdditonalAttributes>
<Attribute>
<Name> Occupation </Name>
<Value> Unknown </Value>
</Attribute>
<Attribute>
<Name> CarrierCode </Name>
<Value> 656 </Value>
</Attribute>
<AdditonalAttributes>
'''
soup = BeautifulSoup(xml, 'xml')
for a in soup.select('Attribute'):
if a.Name.get_text(strip=True) in ['Occupation','CarrierCode']:
print(a.Value.get_text(strip=True))
Output
Unknown
656