I have a xml file in this format :
<data>
<body>
<ID>1</ID>
<SomeNestedElement>
<SomeOtherNestedElement>
<SomeDeepNestedElement1>Something1</SomeDeepNestedElement1>
<SomeDeepNestedElement2>Something2</SomeDeepNestedElement2>
<SomeDeepNestedElement3>Something3</SomeDeepNestedElement3>
</SomeOtherNestedElement>
<SomeOtherNestedElement1>
<SomeDeepNestedElement1>Something4</SomeDeepNestedElement1>
<SomeDeepNestedElement2>Something5</SomeDeepNestedElement2>
<SomeDeepNestedElement3>Something3</SomeDeepNestedElement3>
</SomeOtherNestedElement1>
</SomeNestedElement>
</body>
<body>
<ID>2</ID>
<SomeNestedElement>
<SomeOtherNestedElement>
<SomeDeepNestedElement1>Something1A</SomeDeepNestedElement1>
<SomeDeepNestedElement2>Something2A</SomeDeepNestedElement2>
<SomeDeepNestedElement3>Something3A</SomeDeepNestedElement3>
</SomeOtherNestedElement>
<SomeOtherNestedElement1>
<SomeDeepNestedElement1>Something4A</SomeDeepNestedElement1>
<SomeDeepNestedElement2>Something5A</SomeDeepNestedElement2>
<SomeDeepNestedElement3>Something6A</SomeDeepNestedElement3>
</SomeOtherNestedElement1>
</SomeNestedElement>
</body>
When I use : pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement')
, I get the dataframe as follows :
SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 | |
---|---|---|---|
0 | Something1 | Something2 | Something3 |
1 | Something4 | Something5 | Something6 |
2 | Something1A | Something2A | Something3A |
3 | Something4A | Something5A | Something6A |
Now, I also want to add ID to it, how do I do it?
If I use : pandas.read_xml('data.xml',xpath='data/body/ID | data/body/SomeNestedElement/SomeOtherNestedElement')
,
I get this :
ID | SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 | |
---|---|---|---|---|
0 | 1 | None | None | None |
1 | None | Something1 | Something2 | Something3 |
2 | None | Something4 | Something5 | Something6 |
3 | 2 | None | None | None |
4 | None | Something1A | Something2A | Something3A |
5 | None | Something4A | Something5A | Something6A |
This is expected result:
ID | SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 | |
---|---|---|---|---|
0 | 1 | Something1 | Something2 | Something3 |
1 | 1 | Something4 | Something5 | Something6 |
2 | 2 | Something1A | Something2A | Something3A |
3 | 2 | Something4A | Something5A | Something6A |
I have seen a solution of doing pandas.read_xml().join(pandas.read_xml()). I tried: pandas.read_xml('data.xml',xpath='data/body/ID').join(pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement'))
But this raises an error because pandas.read_xml should return more than one element and pandas.read_xml('data.xml',xpath='data/body/ID')
returns only one.
I also tried creating a dummy column and dropping it, like:
pandas.read_xml('data.xml',xpath='data/body/ID|data/body/dummy_column').drop(columns='dummy_column').join(pandas.read_xml('data.xml',xpath='data/body/SomeNestedElement/SomeOtherNestedElement'))
This is the best I could achieve so far with the above part:
ID | SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 | |
---|---|---|---|---|
0 | 1 | Something1 | Something2 | Something3 |
1 | None | Something4 | Something5 | Something6 |
2 | 2 | Something1A | Something2A | Something3A |
3 | None | Something4A | Something5A | Something6A |
This is not perfect and also it uses two pandas.read_xml functions which is quite expensive assuming that I work on thousands of data points.
How to get the IDs for rest of the rows as well to match the expected result like this and if possible use only one pandas.read_xml()?
ID | SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 | |
---|---|---|---|---|
0 | 1 | Something1 | Something2 | Something3 |
1 | 1 | Something4 | Something5 | Something6 |
2 | 2 | Something1A | Something2A | Something3A |
3 | 2 | Something4A | Something5A | Something6A |
CodePudding user response:
From your second table, use:
df = pd.read_xml('data.xml', '(.//ID|.//SomeNestedElement/*)')
df = df.assign(ID=df['ID'].ffill().astype(int)).loc[df['ID'].isna()]
Output:
ID | SomeDeepNestedElement1 | SomeDeepNestedElement2 | SomeDeepNestedElement3 |
---|---|---|---|
1 | Something1 | Something2 | Something3 |
1 | Something4 | Something5 | Something3 |
2 | Something1A | Something2A | Something3A |
2 | Something4A | Something5A | Something6A |