Rearrange the values in a file in python-CodePudding

I have a very large file of about 900k values. It is a repetition of values like

/begin throw
    COLOR red
     DESCRIPTION
     "cashmere sofa throw"
      10
      10
      156876
     DIMENSION
      140
      200
    STORE_ADDRESS 59110
/end throw

The values keep changing, but I need it like below:

    /begin throw
     STORE_ADDRESS 59110
        COLOR red
         DESCRIPTION "cashmere sofa throw" 10 10 156876
         DIMENSION 140 200
    /end throw

Currently, my approach is removing the new line and including space in them:

the store address is constant throughout the file so I thought of removing it from the index and inserting it before the description

text_file = open(filename, 'r')
filedata = text_file.readlines();

for num,line in enumerate(filedata,0):
    if '/begin' in line:
        for index in range(num, len(filedata)):
            if "store_address 59110 " in filedata[index]:
                    filedata.remove(filedata[index])
                    filedata.insert(filedata[index-7])
                    break
                  
            if "DESCRIPTION" in filedata[index]:
                try:
                    filedata[index] = filedata[index].replace("\n", " ")
                    filedata[index 1] = filedata[index 1].replace(" ","").replace("\n", " ")
                    filedata[index 2] = filedata[index 2].replace(" ","").replace("\n", " ")
                    filedata[index 3] = filedata[index 3].replace(" ","").replace("\n", " ")
                    filedata[index 4] = filedata[index 4].replace(" ","").replace("\n", " ")
                    filedata[index 5] = filedata[index 5].replace(" ","").replace("\n", " ")
                    filedata[index 6] = filedata[index 6].replace(" ","").replace("\n", " ")
                    filedata[index 7] = filedata[index 7].replace(" ","").replace("\n", " ")
                    filedata[index 8] = filedata[index 8].replace(" ","")
                except IndexError:
                    print("Error Index DESCRIPTION:", index, num)
                
            if "DIMENSION" in filedata[index]:
                try:
                    filedata[index] = filedata[index].replace("\n", " ")
                    filedata[index 1] = filedata[index 1].replace(" ","").replace("\n", " ")
                    filedata[index 2] = filedata[index 2].replace(" ","").replace("\n", " ")
                    filedata[index 3] = filedata[index 3].replace(" ","")
                except IndexError:
                    print("Error Index DIMENSION:", index, num)

After which I write filedata into another file.

This approach is taking too long to run(almost an hour and a half) because as mentioned earlier it is a large file. I was wondering if there was a faster approach to this issue

CodePudding user response：

You can read the file structure by structure so that you don't have to store the whole content in memory and manipulate it there. By structure, I mean all the values between and including /begin throw and /end throw. This should be much faster.

def rearrange_structure_and_write_into_file(structure, output_file):
    # TODO: rearrange the elements in structure and write the result into output_file

current_structure = ""
with open(filename, 'r') as original_file:
    with open(output_filename, 'w') as output_file:
        for line in original_file:
            current_structure  = line
            if "/end throw" in line:
                rearrange_structure_and_write_into_file(current_structure, output_file)
                current_structure = ""

CodePudding user response：

The insertion and removal of values from a long list is likely to make this code slower than it needs to be, and also makes it vulnerable to any errors and difficult to reason about. If there are any entries without store_address then the code would not work correctly and would search through the remaining entries until it finds a store address.

A better approach would be to break down the code into functions that parse each entry and output it:

KEYWORDS = ["STORE_ADDRESS", "COLOR", "DESCRIPTION", "DIMENSION"]
    
def parse_lines(lines):
    """ Parse throw data from lines in the old format """
    current_section = None
    r = {}
    for line in lines:
        words = line.strip().split(" ")
        if words[0] in KEYWORDS:
            if words[1:]:
                r[words[0]] = words[1]
            else:
                current_section = r[words[0]] = []
        else:
            current_section.append(line.strip())
    return r

def output_throw(throw):
    """ Output a throw entry as lines of text in the new format """ 
    yield "/begin throw"
    for keyword in KEYWORDS:
        if keyword in throw:
            value = throw[keyword]
            if type(value) is list:
                value = " ".join(value)
            yield f"{keyword} {value}"
    yield "/end throw"
   
with open(filename) as in_file, open("output.txt", "w") as out_file:
    entry = []
    for line in in_file:
        line = line.strip()
        if line == "/begin throw":
            entry = []
        elif line == "/end throw":
            throw = parse_lines(entry)
            for line in output_throw(throw):
                out_file.write(line   "\n")
        else:
            entry.append(line)

Or if you really need to maximize performance by removing all unnecessary operations you could read, parse and write in a single long condition, like this:

with open(filename) as in_file, open("output.txt", "w") as out_file:
    entry = []
    in_section = True
    def write(line):
        out_file.write(line   "\n")
    for line in in_file:
        line = line.strip()
        first = line.split()[0]
        if line == "/begin throw":
            in_section = False
            write(line)
            entry = []
        elif line == "/end throw":
            in_section = False
            for line_ in entry:
                write(line_)
            write(line)
        elif first == "STORE_ADDRESS":
            in_section = False
            write(line)
        elif line in KEYWORDS:
            in_section = True
            entry.append(line)
        elif first in KEYWORDS:
            in_section = False
            entry.append(line)
        elif in_section:
            entry[-1]  = " "   line