Home > Mobile >  parse xml files in root folder and its sub folders
parse xml files in root folder and its sub folders

Time:10-09

I'm working on a directory with 27 folders and various XML files inside each folder.

was able to:

  • parse one XML file and write to a CSV file
  • traverse one folder and read and parse all XML files in it

challenges:

  • having a problem trying to go through and parse all the XML files from the root folder to its subfolders

Send help and thank you. Code snippets below

# working in one folder only
import csv
import xml.etree.ElementTree as ET
import os

## directory
path = "/Users.../y"
filenames = []

## Count the number of xml files of each folder
files = os.listdir(path)
print("\n")


xml_data_to_csv = open('/Users.../xml_extract.csv', 'w')
list_head = []
csvwriter = csv.writer(xml_data_to_csv)

# Read XML files in a folder
for filename in os.listdir(path):
    if not filename.endswith('.xml'):
        continue
    fullname = os.path.join(path,filename)
    print("\n", fullname)
    filenames.append(fullname)

# parse elements in each XML file
for filename in filenames:
    tree = ET.parse(filename)
    root = tree.getroot()

    extract_xml=[]

    ## extract child elements per xml file
    print("\n")
    for x in root.iter('Info'):
        for element in x:
            print(element.tag,element.text)
            extract_xml.append(element.text)


    ## Write list nodes to csv
    csvwriter.writerow(extract_xml)

## Close CSV file
xml_data_to_csv.close()

enter image description here

CodePudding user response:

You can use os.walk:

import os
for dir_name, dirs, files in os.walk('<root_dir>'):
    # parse files

CodePudding user response:

You can get the list of all the XML files in a given path with

import os

path = "main/root"
filelist = []

for root, dirs, files in os.walk(path):
    for file in files:
        if not file.endswith('.xml'):
            continue
        filelist.append(os.path.join(root, file))

for file in filelist:
    print(file)
    # or in your case parse the XML 'file'

If for instance:

$ tree /main/root
/main/root
├── a
│   ├── a.xml
│   ├── b.xml
│   └── c.xml
├── b
│   ├── d.xml
│   ├── e.xml
│   └── x.txt
└── c
    ├── f.xml
    └── g.xml

we get:

/main/root/c/g.xml
/main/root/c/f.xml
/main/root/b/e.xml
/main/root/b/d.xml
/main/root/a/c.xml
/main/root/a/b.xml
/main/root/a/a.xml

If you want to sort directories and files:

for root, dirs, files in os.walk(path):
    dirs.sort()
    for file in sorted(files):
        if not file.endswith('.xml'):
            continue
        filelist.append(os.path.join(root, file))

CodePudding user response:

You can use the pathlib module to "glob" the XML files. It will search all subdirectories for the pattern you supply and return Path objects that already include the path to the file. Cleaning up your script a bit, you would have

import csv
import xml.etree.ElementTree as ET
from pathlib import Path

## directory
path = Path("/Users.../y")

with open('/Users.../xml_extract.csv', 'w') as xml_data_to_csv:
    csvwriter = csv.writer(xml_data_to_csv)

    # Read XML files in a folder
    for filepath in path.glob("**/*.xml"):
        tree = ET.parse(filename)
        root = tree.getroot()

        extract_xml=[]

        ## extract child elements per xml file
        print("\n")
        for x in root.iter('Info'):
            for element in x:
                print(element.tag,element.text)
                extract_xml.append(element.text)

        ## Write list nodes to csv
        csvwriter.writerow(extract_xml)
  • Related