i have a a number of xml files with me, who's format i:
<objects>
<object>
<record>
<invoice_source>EMAIL</invoice_source>
<invoice_capture_date>2022-11-18</invoice_capture_date>
<document_type>INVOICE</document_type>
<data_capture_provider_code>00001</data_capture_provider_code>
<data_capture_provider_reference>1264</data_capture_provider_reference>
<document_capture_provide_code>00002</document_capture_provide_code>
<document_capture_provider_ref>1264</document_capture_provider_ref>
<rows/>
</record>
</object>
</objects>
there is two root objects in this xml. i want to remove one of them using. i want the xml to look like this:
<objects>
<record>
<invoice_source>EMAIL</invoice_source>
<invoice_capture_date>2022-11-18</invoice_capture_date>
<document_type>INVOICE</document_type>
<data_capture_provider_code>00001</data_capture_provider_code>
<data_capture_provider_reference>1264</data_capture_provider_reference>
<document_capture_provide_code>00002</document_capture_provide_code>
<document_capture_provider_ref>1264</document_capture_provider_ref>
<rows/>
</record>
</objects>
i have a folder full of this files. i want to do it using python. is there any way.
CodePudding user response:
The direct way is shown below. If your real files are more complicated than one-object/one-record you'll have to be more specific with examples:
from xml.etree import ElementTree as et
xml = '''\
<objects>
<object>
<record>
<invoice_source>EMAIL</invoice_source>
<invoice_capture_date>2022-11-18</invoice_capture_date>
<document_type>INVOICE</document_type>
<data_capture_provider_code>00001</data_capture_provider_code>
<data_capture_provider_reference>1264</data_capture_provider_reference>
<document_capture_provide_code>00002</document_capture_provide_code>
<document_capture_provider_ref>1264</document_capture_provider_ref>
<rows/>
</record>
</object>
</objects>
'''
objects = et.fromstring(xml)
objects.append(objects[0][0]) # move "record" out of "object" and append as child to "objects"
objects.remove(objects[0]) # remove empty "object"
et.indent(objects) # reformat indentation (Python 3.9 )
et.dump(objects) # show result
Output:
<objects>
<record>
<invoice_source>EMAIL</invoice_source>
<invoice_capture_date>2022-11-18</invoice_capture_date>
<document_type>INVOICE</document_type>
<data_capture_provider_code>00001</data_capture_provider_code>
<data_capture_provider_reference>1264</data_capture_provider_reference>
<document_capture_provide_code>00002</document_capture_provide_code>
<document_capture_provider_ref>1264</document_capture_provider_ref>
<rows />
</record>
</objects>
Another option that would handle any nested content in object
:
objects = et.fromstring(xml)
objects = objects[0] # extract "object" (lose "objects" layer)
objects.tag = 'objects' # rename "object" tag
et.indent(objects) # reformat indentation (Python 3.9 )
et.dump(objects) # show result (same output)
CodePudding user response:
My approach is to iterate over the children of <objects>
, which is <object>
, then move the <record>
nodes up one level. After which, I can remove the <object>
nodes.
import xml.etree.ElementTree as ET
doc = ET.parse("input.xml")
objects = doc.getroot()
for obj in objects:
for record in obj:
objects.append(record)
objects.remove(obj)
doc.write("output.xml")
Here is the contents of output.xml:
<objects>
<record>
<invoice_source>EMAIL</invoice_source>
<invoice_capture_date>2022-11-18</invoice_capture_date>
<document_type>INVOICE</document_type>
<data_capture_provider_code>00001</data_capture_provider_code>
<data_capture_provider_reference>1264</data_capture_provider_reference>
<document_capture_provide_code>00002</document_capture_provide_code>
<document_capture_provider_ref>1264</document_capture_provider_ref>
<rows />
</record>
</objects>