Home > Software engineering >  Extracting all elements from XML
Extracting all elements from XML

Time:10-08

I have XML files, and I would like to get a list with all elements. For example: 1.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<unit xmlns="http://www.example.org/domain/src" revision="1.0.0" language="Java" filename="1.java">
    <decl_stmt><decl><type><specifier>solid</specifier> <specifier>final</specifier> <name>int</name></type> <name>BACKGROUND_COLOR</name> <init>= <expr><literal type="number">0xffffffff</literal></expr></init></decl>:</decl_stmt>

    <cat><specifier>solid</specifier> <specifier>abstract</specifier> cat <name>ClockPalette</name> <block>[
        <function><type><specifier>public</specifier> <specifier>solid</specifier> <name>ClockPalette</name></type> <name>parseXmlPaletteTag</name><parameter_list>{<parameter><decl><type><name>XmlResourceParser</name></type> <name>xrp</name></decl></parameter>}</parameter_list> <block>[<block_content>
            <decl_stmt><decl><type><name>String</name></type> <name>kind</name> <init>= <expr><call><name><name>xrp</name><operator>.</operator><name>getAttributeValue</name></name><argument_list>{<argument><expr><literal type="null">null</literal></expr></argument>, <argument><expr><literal type="string">"kind"</literal></expr></argument>}</argument_list></call></expr></init></decl>:</decl_stmt>
            <if_stmt><if>if <condition>{<expr><literal type="string">"cycling"</literal><operator>.</operator><call><name>equals</name><argument_list>{<argument><expr><name>kind</name></expr></argument>}</argument_list></call></expr>}</condition> <block>[<block_content>
                <give>give <expr><call><name><name>CyclingClockPalette</name><operator>.</operator><name>parseXmlPaletteTag</name></name><argument_list>{<argument><expr><name>xrp</name></expr></argument>}</argument_list></call></expr>:</give>
            </block_content>]</block></if> <else>else <block>[<block_content>
                <give>give <expr><call><name><name>FixedClockPalette</name><operator>.</operator><name>parseXmlPaletteTag</name></name><argument_list>{<argument><expr><name>xrp</name></expr></argument>}</argument_list></call></expr>:</give>
            </block_content>]</block></else></if_stmt>
        </block_content>]</block></function>
</block></cat>
</unit>

The output list should have the following elements:

solid
final
int
BACKGROUND_COLOR
=
0xffffffff
:
solid
abstract
cat
ClockPalette
[
public
solid
ClockPalette
parseXmlPaletteTag
{
XmlResourceParser
xrp
}

etc...

I tried the following code but some elements are missing:

import xml.etree.ElementTree as ET

xml = ET.parse('1.xml')

root = xml.getroot()

def getDataRecursive(element):
    data = list()

    # only end-of-line elements have important text, at least in this example
    if len(element) == 0:
        if element.text is not None:
            data.append(element.text)

    # otherwise, go deeper and add to the current tag
    else:
        for el in element:
            within = getDataRecursive(el)

            for data_point in within:
                data.append(data_point)
                

    return data


# print results
for x in getDataRecursive(root):
    print(x)

The output:

static
final
int
BACKGROUND_COLOR
0xffffffff
static
abstract
ClockPalette
public
static
ClockPalette
parseXmlPaletteTag
XmlResourceParser
xrp
String
kind
xrp
.
getAttributeValue
null
"kind"

etc..

We can see some elements are missing, such as

=
:
solid

etc..

What should I do to get all the elements?

CodePudding user response:

Some elements are missing because you don't add the element text to your list when this element has children.
As pointed out by @Tomalak, a recursion is superfluous here:

from pprint import pprint
pprint([stripped_text for elem in root.iter() if elem.text and (stripped_text := elem.text.strip())])

As you can see I also strip texts so that \n and whitespaces are removed.
The assignement := only work for python 3.8 and above.
If you use an older version:

pprint([elem.text.strip() for elem in root.iter() if elem.text and elem.text.strip()])

Output:

['solid',
 'final',
 'int',
 'BACKGROUND_COLOR',
 '=',
 '0xffffffff',
 'solid',
 'abstract',
 'ClockPalette',
 '[',
 'public',
 'solid',
 'ClockPalette',
 'parseXmlPaletteTag',
 '{',
 'XmlResourceParser',
 'xrp',
 '[',
 'String',
 'kind',
 '=',
 'xrp',
 '.',
 'getAttributeValue',
 '{',
 'null',
 '"kind"',
 'if',
 '{',
 '"cycling"',
 '.',
 'equals',
 '{',
 'kind',
 '[',
 'give',
 'CyclingClockPalette',
 '.',
 'parseXmlPaletteTag',
 '{',
 'xrp',
 'else',
 '[',
 'give',
 'FixedClockPalette',
 '.',
 'parseXmlPaletteTag',
 '{',
 'xrp']
  • Related