How to check xml tags without text content in python?-CodePudding

I'm using Python 3.10 with lxml to validate xml files that have been generated by a VBA Macro. But before that, I have to check each file to see if there're some parts of the tree that don't contain any text content ( except blank character ) to remove them.

Example:

<n4ds:S10_G00_00>
        <n4ds:S10_G00_00_001>DGFIP-CAPU</n4ds:S10_G00_00_001>
        <n4ds:S10_G00_00_002>CUNOMF001_Janv2021</n4ds:S10_G00_00_002>
        <n4ds:S10_G00_00_003>v2022</n4ds:S10_G00_00_003>
        <n4ds:S10_G00_00_005>02</n4ds:S10_G00_00_005>
        <n4ds:S10_G00_00_006>P21V01</n4ds:S10_G00_00_006>
        <n4ds:S10_G00_00_008>01</n4ds:S10_G00_00_008>
        <n4ds:S10_G00_01>
            <n4ds:S10_G00_01_001>501975304</n4ds:S10_G00_01_001>
            <n4ds:S10_G00_01_002>26012</n4ds:S10_G00_01_002>
            <n4ds:S10_G00_01_003>NOMINATIF142021</n4ds:S10_G00_01_003>
            <n4ds:S10_G00_01_004>Avenue des Champs-Elysees</n4ds:S10_G00_01_004>
            <n4ds:S10_G00_01_005>93333</n4ds:S10_G00_01_005>
            <n4ds:S10_G00_01_006>BOURBOURG</n4ds:S10_G00_01_006>
            <n4ds:S10_G00_01_008>Z</n4ds:S10_G00_01_008>
            <n4ds:S10_G00_01_009>APT 25B</n4ds:S10_G00_01_009>
        </n4ds:S10_G00_01>
        <n4ds:S10_G00_02>
            <n4ds:S10_G00_02_001>01</n4ds:S10_G00_02_001>
            <n4ds:S10_G00_02_002>Pierre TOPAZE</n4ds:S10_G00_02_002>
            <n4ds:S10_G00_02_004>[email protected]</n4ds:S10_G00_02_004>
            <n4ds:S10_G00_02_005>0744215264</n4ds:S10_G00_02_005>
        </n4ds:S10_G00_02>
        <n4ds:S10_G00_95>
            <n4ds:S10_G00_95_001>LIART</n4ds:S10_G00_95_001>
            <n4ds:S10_G00_95_002>HAM-LES-MOINES</n4ds:S10_G00_95_002>
            <n4ds:S10_G00_95_003>50197530426012</n4ds:S10_G00_95_003>
            <n4ds:S10_G00_95_006>MtoM</n4ds:S10_G00_95_006>
            <n4ds:S10_G00_95_008>20210101091230</n4ds:S10_G00_95_008>
            <n4ds:S10_G00_95_900>2101NEORAUB3Message14CollectePH004</n4ds:S10_G00_95_900>
            <n4ds:S10_G00_95_901>[email protected]</n4ds:S10_G00_95_901>
        </n4ds:S10_G00_95>
        <n4ds:S20_G00_05 xsi:type="n4ds:Message_mensuel_des_revenus_autres">
            <n4ds:S20_G00_05_001>14</n4ds:S20_G00_05_001>
            <n4ds:S20_G00_05_002>01</n4ds:S20_G00_05_002>
            <n4ds:S20_G00_05_003>12</n4ds:S20_G00_05_003>
            <n4ds:S20_G00_05_004>250319523010</n4ds:S20_G00_05_004>
            <n4ds:S20_G00_05_005>2021-01-01</n4ds:S20_G00_05_005>
            <n4ds:S20_G00_05_007>2020-12-01</n4ds:S20_G00_05_007>
            <n4ds:S20_G00_05_009>IdMed001</n4ds:S20_G00_05_009>
            <n4ds:S20_G00_05_010>01</n4ds:S20_G00_05_010>
            <n4ds:S20_G00_07>
                <n4ds:S20_G00_07_001>VINCENT Tim</n4ds:S20_G00_07_001>
                <n4ds:S20_G00_07_002>0102030405</n4ds:S20_G00_07_002>
                <n4ds:S20_G00_07_003>[email protected]</n4ds:S20_G00_07_003>
                <n4ds:S20_G00_07_004>10</n4ds:S20_G00_07_004>
            </n4ds:S20_G00_07>
            <n4ds:S20_G00_96>
                <n4ds:S20_G00_96_902>4</n4ds:S20_G00_96_902>
            </n4ds:S20_G00_96>
            <n4ds:S21_G00_06>
                <n4ds:S21_G00_06_001>508203890</n4ds:S21_G00_06_001>
                <n4ds:S21_G00_06_002>26012</n4ds:S21_G00_06_002>
                <n4ds:S21_G00_06_003>5510Z</n4ds:S21_G00_06_003>
                <n4ds:S21_G00_06_004>PLACE VENDOME</n4ds:S21_G00_06_004>
                <n4ds:S21_G00_06_005>92600</n4ds:S21_G00_06_005>
                <n4ds:S21_G00_06_006>ASNIERE</n4ds:S21_G00_06_006>
                <n4ds:S21_G00_06_903>CONSEIL PASRAU</n4ds:S21_G00_06_903>
                <n4ds:S21_G00_11>
                    <n4ds:S21_G00_11_001>31284</n4ds:S21_G00_11_001>
                    <n4ds:S21_G00_11_002>8423Z</n4ds:S21_G00_11_002>
                    <n4ds:S21_G00_11_003>RUE DU PARADIS</n4ds:S21_G00_11_003>
                    <n4ds:S21_G00_11_004>75010</n4ds:S21_G00_11_004>
                    <n4ds:S21_G00_11_005>ALBERVILLE</n4ds:S21_G00_11_005>
                    <n4ds:S21_G00_11_006>CEDEX 99</n4ds:S21_G00_11_006>
                    <n4ds:S21_G00_11_111>20210210</n4ds:S21_G00_11_111>
                    <n4ds:S21_G00_11_904>SRENOMINATIF</n4ds:S21_G00_11_904>
                    <n4ds:S21_G00_11_905>0</n4ds:S21_G00_11_905>


                    <n4ds:S21_G00_30>
                        <n4ds:S21_G00_31></n4ds:S21_G00_31>
                        <n4ds:S21_G00_47>
                            <n4ds:S21_G00_48></n4ds:S21_G00_48>
                        </n4ds:S21_G00_47>
                        <n4ds:S21_G00_50>
                            <n4ds:S21_G00_51></n4ds:S21_G00_51>
                            <n4ds:S21_G00_56></n4ds:S21_G00_56>
                        </n4ds:S21_G00_50>
                        <n4ds:S21_G00_97></n4ds:S21_G00_97>
                    </n4ds:S21_G00_30>



                </n4ds:S21_G00_11>
            </n4ds:S21_G00_06>
        </n4ds:S20_G00_05>
    </n4ds:S10_G00_00>

In this case, to validate my file, I need to remove the part between n4ds:S21_G00_30 and </n4ds:S21_G00_30> (and the tag itself ).

I've tried this code :

pattern = "<n4ds:(.) >(\s)*<\/n4ds:(.) >"
repl = ''
def remove_empty_tags(file, pattern, repl):
    clean_lines = []
    with open(file, 'r') as fh:
        for line in fh:
            clean_lines.append(re.sub(pattern, repl, line))
    # Now save the file:
    with open(file, 'w') as fh:
        for line in clean_lines:
            fh.write(line)

But I have some trouble to find the right regex expression (Using regex with XML/HTML seems to be a bad idea). As it is right now, It doesn't deal with nested tags.

I saw that I could parse my file by using ElementTree but I couldn't find a solution to iterate and check the existance of an empty trees .

If anybody knows how can I solve this problem, I would be very happy to have some help.

Best regards.

CodePudding user response：

Using regex with XML/HTML seems to be a bad idea

It is a horrible idea.

As it is right now, It doesn't deal with nested tags.

...and that's one of the reasons why.

You said you have lxml. Use it.

Elements that have no text except whitespace (i.e. "empty after whitespace normalization") can be found in XPath with the condition normalize-space() = '', and those that have no child elements with not(*).

It's easy to remove them from their respective parent elements in a loop.

from lxml import etree as ET

tree = ET.parse(r'C:\path\to\your\input.xml')

while True:
    empty_nodes = tree.xpath("//*[normalize-space() = '' and not(*)]")
    if not empty_nodes:
        break
    for node in empty_nodes:
        node.getparent().remove(node)

tree.write(r'C:\path\to\your\output.xml', pretty_print=True)

That being said, since you're using MSXML in that VBA macro (right?), and MSXML supports XPath, you can do the exact same thing right then and there, without ever saving the XML file in a state that needs post-processing in Python.