Home > Enterprise >  Python - Problem understanding the BeautifulSoup XML structure and (re-)writing to file
Python - Problem understanding the BeautifulSoup XML structure and (re-)writing to file

Time:08-20

I am trying to use Python to rewrite an xml file (filtered) to a properties file (comment, key, value).

My xml file looks like this:

<?xml version="1.0" encoding="utf-8"?><martif type="TBX" xml:lang="en">
    <text>
      <body>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175B1">
            <descrip type="definition">A short note to explain the term</descrip>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">AC</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">compte</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175BF">
            <descrip type="definition"/>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">EHS</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">daño</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>            
        </body>
    </text>

I wrote the following code:

from bs4 import BeautifulSoup
import io
with open('SAPterm_TEST_ES.tbx', 'r') as f:
    file = f.read()

soup = BeautifulSoup(file, 'xml')

with io.open('ES.properties', 'w ') as f:

    Term = soup.find_all('termEntry')
    for termEntry in Term:
        print(termEntry('admin', {'type': 'productSubset'}))
        #f.write(termEntry('admin', {'type': 'productSubset'}).text)
        #f.write(' - ')        
        #f.write(termEntry('descrip', {'type': 'definition'}).text)
        #f.write('\n')
        f.write(termEntry['id'])
        f.write(' = ')
        f.write(termEntry.term.text)
        f.write('\n')

The result is this:

A) output in Console:

TBX_2_properties.py
[<admin type="productSubset">AC</admin>]
[<admin type="productSubset">EHS</admin>]

B) resulting file:

tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño

The issue I am having: I can output the tag I want to also include in the properties file as a Comment with print to the Console, but when writing it to the file, I always fail.

Error:

ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?


 File "C:\temp\Rewrite_TBX_2_properties.py", line 13, in <module>
    f.write(termEntry('admin', {'type': 'productSubset'}).text)

(when I e.g. uncomment line 13 "f.write(termEntry('admin', {'type': 'productSubset'}).text)"

This I don't understand and need your help with (I am a new to Phyton! ;-) )

Furthermore, when I try this on a larger xml file (the one above is just a small version I used to test the basics, I get key/value pairs where all the keys are correctly extracted from the termEntry id-attibute, but the value is always the same, always the one from the first entry.

Anybody any advise?

Thanks so much!

BTW, the result from the larger XML file looks like this:

tid_db6_015DAA6C5610D311AE6500A0C9EAAA94 = megbontási maradvány
tid_db6_01A0763DDC77D3118F330060B03CA38C = megbontási maradvány
tid_db6_01ADF9FDE40072439FF56E731E7EA2F6 = megbontási maradvány
tid_db6_01BCEA3AD3E2D3119B4F0060B0671ACC = megbontási maradvány
tid_db6_02BF9A6D9898D511AE780800062AFB0F = megbontási maradvány
tid_db6_0381F77095126448887E05013CBB4682 = megbontási maradvány
tid_db6_03CFFC4C6B64D311B60F0060B03C2BFF = megbontási maradvány
tid_db6_043968D0FAB9484DA122122A31C7A95C = megbontási maradvány

CodePudding user response:

termEntry('admin', {'type': 'productSubset'}) will return ResultSet and this object doesn't have .text attribute - so you get this error. You should iterate over this result set and then use .text.

If soup contains the XML document from the question you can do:

with open("out.txt", "w") as f_out:
    for term_entry in soup.select("termEntry"):
        admin = term_entry.select_one('admin[type="productSubset"]')
        desc = term_entry.select_one('descrip[type="definition"]')
        print("{} - {}".format(admin.text, desc.text), file=f_out)

        for term in term_entry.select("term"):
            print("{} = {}".format(term_entry["id"], term.text), file=f_out)

This creates out.txt with content:

AC - A short note to explain the term
tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
EHS - 
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño
  • Related