Home > Software design >  How to display rows correctly while using pandas.read_xml()?
How to display rows correctly while using pandas.read_xml()?

Time:02-15

I have a xml file in this format :

<data>
  <body>
    <ID>1</ID>
    <SomeNestedElement>
      <SomeOtherNestedElement>
        <SomeDeepNestedElement1>Something1</SomeDeepNestedElement1>
        <SomeDeepNestedElement2>Something2</SomeDeepNestedElement2>
        <SomeDeepNestedElement3>Something3</SomeDeepNestedElement3>
      </SomeOtherNestedElement>
      <SomeOtherNestedElement1>
        <SomeDeepNestedElement1>Something4</SomeDeepNestedElement1>
        <SomeDeepNestedElement2>Something5</SomeDeepNestedElement2>
        <SomeDeepNestedElement3>Something3</SomeDeepNestedElement3>
      </SomeOtherNestedElement1>
    </SomeNestedElement>
  </body>
  <body>
    <ID>2</ID>
    <SomeNestedElement>
      <SomeOtherNestedElement>
        <SomeDeepNestedElement1>Something1A</SomeDeepNestedElement1>
        <SomeDeepNestedElement2>Something2A</SomeDeepNestedElement2>
        <SomeDeepNestedElement3>Something3A</SomeDeepNestedElement3>
      </SomeOtherNestedElement>
      <SomeOtherNestedElement1>
        <SomeDeepNestedElement1>Something4A</SomeDeepNestedElement1>
        <SomeDeepNestedElement2>Something5A</SomeDeepNestedElement2>
        <SomeDeepNestedElement3>Something6A</SomeDeepNestedElement3>
      </SomeOtherNestedElement1>
    </SomeNestedElement>
  </body>

When I use : pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement'), I get the dataframe as follows :

SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
0 Something1 Something2 Something3
1 Something4 Something5 Something6
2 Something1A Something2A Something3A
3 Something4A Something5A Something6A

Now, I also want to add ID to it, how do I do it?

If I use : pandas.read_xml('data.xml',xpath='data/body/ID | data/body/SomeNestedElement/SomeOtherNestedElement'), I get this :

ID SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
0 1 None None None
1 None Something1 Something2 Something3
2 None Something4 Something5 Something6
3 2 None None None
4 None Something1A Something2A Something3A
5 None Something4A Something5A Something6A

This is expected result:

ID SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
0 1 Something1 Something2 Something3
1 1 Something4 Something5 Something6
2 2 Something1A Something2A Something3A
3 2 Something4A Something5A Something6A

I have seen a solution of doing pandas.read_xml().join(pandas.read_xml()). I tried: pandas.read_xml('data.xml',xpath='data/body/ID').join(pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement'))

But this raises an error because pandas.read_xml should return more than one element and pandas.read_xml('data.xml',xpath='data/body/ID') returns only one.

I also tried creating a dummy column and dropping it, like:

pandas.read_xml('data.xml',xpath='data/body/ID|data/body/dummy_column').drop(columns='dummy_column').join(pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement'))

This is the best I could achieve so far with the above part:

ID SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
0 1 Something1 Something2 Something3
1 None Something4 Something5 Something6
2 2 Something1A Something2A Something3A
3 None Something4A Something5A Something6A

This is not perfect and also it uses two pandas.read_xml functions which is quite expensive assuming that I work on thousands of data points.

How to get the IDs for rest of the rows as well to match the expected result like this and if possible use only one pandas.read_xml()?

ID SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
0 1 Something1 Something2 Something3
1 1 Something4 Something5 Something6
2 2 Something1A Something2A Something3A
3 2 Something4A Something5A Something6A

CodePudding user response:

From your second table, use:

df = pd.read_xml('data.xml', '(.//ID|.//SomeNestedElement/*)')
df = df.assign(ID=df['ID'].ffill().astype(int)).loc[df['ID'].isna()]

Output:

ID SomeDeepNestedElement1 SomeDeepNestedElement2 SomeDeepNestedElement3
1 Something1 Something2 Something3
1 Something4 Something5 Something3
2 Something1A Something2A Something3A
2 Something4A Something5A Something6A
  • Related