Why do I get the same value when I am iterating through my file with BeautifulSoup?-CodePudding

I wanted to split some multivalued attributes inside an XML file.

Here is the content of Newest Report:

 <GenericItem html='ID: AAA1&lt;br/&gt;Age: 12&lt;br/&gt;Name: Baryk &lt;'>
   Employee:
</GenericItem>
<GenericItem html='ID: AAA2&lt;br/&gt;Age: 16&lt;br/&gt;Name: Nils &lt;'>
   Employee:
</GenericItem>
<GenericItem html='ID: AAA3&lt;br/&gt;Age: 18&lt;br/&gt;Name: Sarah &lt;'>
   Employee:
</GenericItem>

And here is the content of my python script :

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('NewestReport.xml', 'r'), 'lxml-xml')
br = soup.find_all("GenericItem")
for i in br:
    for i in soup.find("GenericItem").get("html").split("<br/>"):
        print(i.split(":")[1].replace("<", "").strip())

with this syntax I received the same value so it keeps printing out the value for Baryk only and none for the rest. Is there anything I can fix so that it moves to the next data?

CodePudding user response：

there are two main issues with your code

you're overriding the i value of the first loop on the second
you're calling find("GenericItem") every time instead of just using the result that you previously saved on the br variable

I think that just by fixing it a bit like this it should do what you're expecting

from bs4 import BeautifulSoup

document = BeautifulSoup(open('NewestReport.xml', 'r'), 'lxml-xml')
items = soup.find_all("GenericItem")

for item in items:
    for line in item.get("html").split("<br/>"):
        print(line.split(":")[1].replace("<", "").strip())

although if you're more clear on what you're trying to achieve we might be able to give you a better suggestion of how to approach this

docs links:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

CodePudding user response：

Note As mentioned by Alex Viscreanu there are some things to handle in your code first.

I would recommend to use lxml parser and css selectors to select the elements - Following iteration will print all values from each GenericItem:

for i in soup.select("GenericItem"):
    for e in i.get("html").split("<br/>"):
        print(e.split(":")[1].replace("<", "").strip())

--->
AAA1
12
Baryk
AAA2
16
Nils
AAA3
18
Sarah

To get more structured data you can create a list of dicts:

data = []

for i in soup.select("GenericItem"):
    t={}
    for d in i.get("html").split("<br/>"):
        c=d.split(":")
        k=c[0]
        v=c[1].strip(' |<')
        t[k]=v
    data.append(t)

Example

from bs4 import BeautifulSoup
xml="""<GenericItem html='ID: AAA1&lt;br/&gt;Age: 12&lt;br/&gt;Name: Baryk &lt;'>
   Employee:
</GenericItem>
<GenericItem html='ID: AAA2&lt;br/&gt;Age: 16&lt;br/&gt;Name: Nils &lt;'>
   Employee:
</GenericItem>
<GenericItem html='ID: AAA3&lt;br/&gt;Age: 18&lt;br/&gt;Name: Sarah &lt;'>
   Employee:
</GenericItem>"""

soup = BeautifulSoup(xml, 'lxml')

data = []

for i in soup.select("GenericItem"):
    t={}
    for d in i.get("html").split("<br/>"):
        c=d.split(":")
        k=c[0]
        v=c[1].strip(' |<')
        t[k]=v
    data.append(t)

data

Output

[{'ID': 'AAA1', 'Age': '12', 'Name': 'Baryk'},
 {'ID': 'AAA2', 'Age': '16', 'Name': 'Nils'},
 {'ID': 'AAA3', 'Age': '18', 'Name': 'Sarah'}]