I wanted to split some multivalued attributes inside an XML file.
Here is the content of Newest Report:
<GenericItem html='ID: AAA1<br/>Age: 12<br/>Name: Baryk <'>
Employee:
</GenericItem>
<GenericItem html='ID: AAA2<br/>Age: 16<br/>Name: Nils <'>
Employee:
</GenericItem>
<GenericItem html='ID: AAA3<br/>Age: 18<br/>Name: Sarah <'>
Employee:
</GenericItem>
And here is the content of my python script :
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('NewestReport.xml', 'r'), 'lxml-xml')
br = soup.find_all("GenericItem")
for i in br:
for i in soup.find("GenericItem").get("html").split("<br/>"):
print(i.split(":")[1].replace("<", "").strip())
with this syntax I received the same value so it keeps printing out the value for Baryk only and none for the rest. Is there anything I can fix so that it moves to the next data?
CodePudding user response:
there are two main issues with your code
- you're overriding the
i
value of the first loop on the second - you're calling
find("GenericItem")
every time instead of just using the result that you previously saved on thebr
variable
I think that just by fixing it a bit like this it should do what you're expecting
from bs4 import BeautifulSoup
document = BeautifulSoup(open('NewestReport.xml', 'r'), 'lxml-xml')
items = soup.find_all("GenericItem")
for item in items:
for line in item.get("html").split("<br/>"):
print(line.split(":")[1].replace("<", "").strip())
although if you're more clear on what you're trying to achieve we might be able to give you a better suggestion of how to approach this
docs links:
CodePudding user response:
Note As mentioned by Alex Viscreanu there are some things to handle in your code first.
I would recommend to use lxml
parser and css selectors
to select the elements - Following iteration will print all values from each GenericItem
:
for i in soup.select("GenericItem"):
for e in i.get("html").split("<br/>"):
print(e.split(":")[1].replace("<", "").strip())
--->
AAA1
12
Baryk
AAA2
16
Nils
AAA3
18
Sarah
To get more structured data you can create a list of dicts:
data = []
for i in soup.select("GenericItem"):
t={}
for d in i.get("html").split("<br/>"):
c=d.split(":")
k=c[0]
v=c[1].strip(' |<')
t[k]=v
data.append(t)
Example
from bs4 import BeautifulSoup
xml="""<GenericItem html='ID: AAA1<br/>Age: 12<br/>Name: Baryk <'>
Employee:
</GenericItem>
<GenericItem html='ID: AAA2<br/>Age: 16<br/>Name: Nils <'>
Employee:
</GenericItem>
<GenericItem html='ID: AAA3<br/>Age: 18<br/>Name: Sarah <'>
Employee:
</GenericItem>"""
soup = BeautifulSoup(xml, 'lxml')
data = []
for i in soup.select("GenericItem"):
t={}
for d in i.get("html").split("<br/>"):
c=d.split(":")
k=c[0]
v=c[1].strip(' |<')
t[k]=v
data.append(t)
data
Output
[{'ID': 'AAA1', 'Age': '12', 'Name': 'Baryk'},
{'ID': 'AAA2', 'Age': '16', 'Name': 'Nils'},
{'ID': 'AAA3', 'Age': '18', 'Name': 'Sarah'}]