Processing files with listdir() breakes when directory contains subdirectories-CodePudding

Following code should walk through directory and grab XML files and process them (i.e. prefixing HTML classes stored in XML elements — however, this is not important in relation to the question). The code works as long as there are no subdirectories inside "/input-dir", but as soon as there are subdirectories, an error message gets thrown out:

Traceback (most recent call last): File "/Users/ab/Code/SHCprefixer-2022/shc-prefixer_upwork.py", line 22, in content = file.readlines(); File "/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 566: invalid start byte

from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import os
import lxml
import re

input_path = "./input-dir";
output_path = "./output-dir";

ls = os.listdir(input_path);
print(ls);

with open("classes.txt", "r") as cls:
    clss = cls.readlines()
    for i in range(len(clss)):
        clss[i] = clss[i].strip()
print(clss);

for d in range(len(ls)):

    with open(f"{input_path}/{ls[d]}", "r") as file:
        content = file.readlines();
        content = "".join(content)
        bs_content = BeautifulSoup(content, "lxml")
        str_bs_content = str(bs_content)
        str_bs_content = str_bs_content.replace("""<?xml version="1.0" encoding="UTF-8"?><html><body>""", "");
        str_bs_content = str_bs_content.replace("</body></html>", "");
        for j in range(len(clss)):
            str_bs_content = str_bs_content.replace(clss[j], f"prefix-{clss[j]}")
    with open(f"{output_path}/{ls[d]}", "w") as f:
        f.write(str_bs_content)

Probably the error is related to the listdir() command, and as indicated in "IsADirectoryError: [Errno 21] Is a directory: " It is a file, I should use os.walk(), but I wasn't able to implement it. Would be great if someone could help.

CodePudding user response：

Looks like you will need to filter out directories from the input path dir. You could use os.path.isfile(x) to check it. Using list comprehension you can get the filtered list in one line:

ls = [f for f in os.listdir(input_path) if os.path.isfile(f)]

CodePudding user response：

You need to test whether the returned file system name is a file. You also want to search the entire subtree. Instead of listdir you could use os.walk, but I think that the newer pathlib module better suites your needs. Its .glob method, when used with "**", will search the subtree and filter for a known file extension at the same time.

from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import lxml
import re
from pathlib import Path

input_path = Path("./input-dir")
output_path = Path("./output-dir")

ls = [p for p in input_path.glob("**/*.xml") if p.is_file()]
print(", ".join(str(p) for p in ls))

with open("classes.txt", "r") as cls:
    clss = cls.readlines()
    for i in range(len(clss)):
        clss[i] = clss[i].strip()
print(clss)

for infile in ls:
    with infile.open() as file:
        bs_content = BeautifulSoup(file.read(), "lxml")
        str_bs_content = str(bs_content)
        str_bs_content = str_bs_content.replace("""<?xml version="1.0" encoding="UTF-8"?><html><body>""", "");
        str_bs_content = str_bs_content.replace("</body></html>", "");
        for j in range(len(clss)):
            str_bs_content = str_bs_content.replace(clss[j], f"prefix-{clss[j]}")
    outfile = output_path / infile.relative_to(input_path)
    outfile.parent.mkdir(parents=True, exist_ok=True)
    with outfile.open("w") as f:
        f.write(str_bs_content)