Home > Back-end >  Removing multiple XML declaration from document
Removing multiple XML declaration from document

Time:01-04

I have a file that has multiple XML declarations.

<?xml version="1.0" encoding="UTF-8"?>

I am currently reading the file as a .txt file and rewriting each line that is not a XML declaration into a new .txt file. As I have many such document files, this method is taking time (around 20mins per file). I wanted to know if there was an easier way to do this.

I am using Python to do this. The files are sitting on my laptop and each file is around 11 Million lines (450mb size).

My code for iterating through the file and removing the declarations is below.

month_file = "2015-01.nml.txt"

delete_lines = [
        '<?xml version="1.0" encoding="ISO-8859-1" ?>',
        '<?xml version="1.0" encoding="iso-8859-1" ?>',
        '<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">',    
    ] 
       
                   
with open(month_file, encoding="ISO-8859-1") as in_fh:
    while True:
        line = in_fh.readline()
        if not line: break

        if any(x in line for x in delete_lines):
            continue 
        else:
            out_fh = open('myfile_faster.xml', "a")
            out_fh.write(line)        
    out_fh.close()

CodePudding user response:

This is essenstially the same as your version, but opens input and output just the once, also has a single if condition, and writes to the output as it iterates through the input (sort of like sed).

with open(in_file, mode="rt") as f_in, open(out_file, mode="wt") as f_out:
    for line in f_in:
        if (
            not line
            or line.startswith("<?xml")
            or line.startswith("<!DOCTYPE")
        ):
            continue
        f_out.write(line)
  • Related