Home > other >  Problem to parse some elements with xml.etree.ElementTree
Problem to parse some elements with xml.etree.ElementTree

Time:09-16

I hope you are well. I am facing some difficulties related to my parser. Indeed, my dataset looks like this :

<?xml version="1.0"?>

<bugrepository name="AspectJ">
  <bug id="28974" opendate="2003-1-3 10:28:00" fixdate="2003-1-14 14:30:00">
    <buginformation>
      <summary>"Compiler error when introducing a ""final"" field"</summary>
      <description>The aspecs the problem...</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java</file>
    </fixedFiles>
  </bug>

  <bug id="28919" opendate="2002-12-30 16:40:00" fixdate="2003-1-14 15:06:00">
    <buginformation>
      <summary>waever tries to weave into native methods ...</summary>
      <description>If youat org.aspectj.ajdt.internal.core.burce</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29186" opendate="2003-1-8 21:22:00" fixdate="2003-1-14 16:43:00">
    <buginformation>
      <summary>ajc -emacssym chokes on pointcut that includes an intertype method</summary>
      <description>This ;void Foo.ajc$before$Foo</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29769" opendate="2003-1-19 11:42:00" fixdate="2003-1-24 21:17:00">
    <buginformation>
      <summary>Ajde does not support new AspectJ 1.1 compiler options</summary>
      <description>The org.aspectj.ajpiler. This enhancement is needed byort.</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java</file>
    </fixedFiles>
  </bug>
  <bug id="29959" opendate="2003-1-22 7:10:00" fixdate="2003-2-13 16:00:00">
    <buginformation>
      <summary>super call in intertype method declaration body causes VerifyError</summary>
      <description>AspectJ Compiler 1.1 showstopper</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java</file>
      <file>org.aspectj/modules/tests/bugs/SuperToIntro.java</file>
    </fixedFiles>
  </bug>
</bugrepository>

I would like to be able to recover some elements of the dataset to use them with Pandas in a dataframe.

First problem is to take all the sub-elements from the tag in list form.

Actually my code, only retrieves the first element and ignores the others or can retrieves all of them but not structured as you can see in theses pictures : here only the empty ([]) lists without content

The code :

import pandas as pd 
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append(item.findall('fixedFiles/file'))
    
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

here only the first element

The code :

import pandas as pd 
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append(item.findtext('fixedFiles/file'))
    
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

I found here "Problem traversing XML tree with Python xml.etree.ElementTree" a solution that I adapt to my case, it works but not like I want (list of list per element), I can load all the element but individually.

The code :

import xml.etree.ElementTree as ET
import pandas as pd 

xmldoc = ET.parse('dataset.xml')
root = xmldoc.getroot()
summary = []
description = []
fixedfile = []

for bug in xmldoc.iter(tag='bug'): 
    
    #for item in document.iterfind('bug'):
    #summary.append(item.findtext('buginformation/summary'))
    #description.append(item.findtext('buginformation/description'))
    
    for file in bug.iterfind('./fixedFiles/file'):
    
           fixedfile.append([file.text])
        
fixedfile
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

When I want to iterate the others columns (summary, description) of my dataframe, I get the following error message: ValueError: All arrays must be of the same length

Second problem, being able to select for example all tags that have 2 or 3 sub-elements.

Best regards,

CodePudding user response:

The below collects the data. The idea is to find all bug elements and iterate over them. For each bug - look for the required sub elements.

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0"?>

<bugrepository name="AspectJ">
  <bug id="28974" opendate="2003-1-3 10:28:00" fixdate="2003-1-14 14:30:00">
    <buginformation>
      <summary>"Compiler error when introducing a ""final"" field"</summary>
      <description>The aspecs the problem...</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java</file>
    </fixedFiles>
  </bug>

  <bug id="28919" opendate="2002-12-30 16:40:00" fixdate="2003-1-14 15:06:00">
    <buginformation>
      <summary>waever tries to weave into native methods ...</summary>
      <description>If youat org.aspectj.ajdt.internal.core.burce</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29186" opendate="2003-1-8 21:22:00" fixdate="2003-1-14 16:43:00">
    <buginformation>
      <summary>ajc -emacssym chokes on pointcut that includes an intertype method</summary>
      <description>This ;void Foo.ajc$before$Foo</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29769" opendate="2003-1-19 11:42:00" fixdate="2003-1-24 21:17:00">
    <buginformation>
      <summary>Ajde does not support new AspectJ 1.1 compiler options</summary>
      <description>The org.aspectj.ajpiler. This enhancement is needed byort.</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java</file>
    </fixedFiles>
  </bug>
  <bug id="29959" opendate="2003-1-22 7:10:00" fixdate="2003-2-13 16:00:00">
    <buginformation>
      <summary>super call in intertype method declaration body causes VerifyError</summary>
      <description>AspectJ Compiler 1.1 showstopper</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java</file>
      <file>org.aspectj/modules/tests/bugs/SuperToIntro.java</file>
    </fixedFiles>
  </bug>
  </bugrepository>'''

data = []
root = ET.fromstring(xml)
for bug in root.findall('.//bug'):
    bug_info = bug.find('buginformation')
    fixed_files = bug.find('fixedFiles')
    entry = {'summary': bug_info.find('summary').text,'description':bug_info.find('summary').text,'fixedFiles':[x.text for x in list(fixed_files)]}
    data.append(entry)
for entry in data:
    print(entry)
df = pd.DataFrame(data)

output

{'summary': '"Compiler error when introducing a ""final"" field"', 'description': '"Compiler error when introducing a ""final"" field"', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java']}
{'summary': 'waever tries to weave into native methods ...', 'description': 'waever tries to weave into native methods ...', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java']}
{'summary': 'ajc -emacssym chokes on pointcut that includes an intertype method', 'description': 'ajc -emacssym chokes on pointcut that includes an intertype method', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java', 'org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java', 'org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java']}
{'summary': 'Ajde does not support new AspectJ 1.1 compiler options', 'description': 'Ajde does not support new AspectJ 1.1 compiler options', 'fixedFiles': ['org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java', 'org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java', 'org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java', 'org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java']}
{'summary': 'super call in intertype method declaration body causes VerifyError', 'description': 'super call in intertype method declaration body causes VerifyError', 'fixedFiles': ['org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java', 'org.aspectj/modules/tests/bugs/SuperToIntro.java']}

CodePudding user response:

To keep the files in a list that are associated with the description and summary, add them to a new list for each bug.

Try:

import pandas as pd
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append([elt.text for elt in item.findall('fixedFiles/file')])

df = pd.DataFrame({'summary': summary,
                   'description': description,
                   'fixed_files': fixedfile})
df

For second part, this will filter only those bugs with two or more files.

newdf = df[df.fixed_files.str.len() >= 2]

If want bugs with exactly 2 and 3 files then:

newdf = df[(df.fixed_files.str.len() == 2) | (df.fixed_files.str.len() == 3)]
  • Related