I have a very large file of about 900k values. It is a repetition of values like
/begin throw
COLOR red
DESCRIPTION
"cashmere sofa throw"
10
10
156876
DIMENSION
140
200
STORE_ADDRESS 59110
/end throw
The values keep changing, but I need it like below:
/begin throw
STORE_ADDRESS 59110
COLOR red
DESCRIPTION "cashmere sofa throw" 10 10 156876
DIMENSION 140 200
/end throw
Currently, my approach is removing the new line and including space in them:
the store address is constant throughout the file so I thought of removing it from the index and inserting it before the description
text_file = open(filename, 'r')
filedata = text_file.readlines();
for num,line in enumerate(filedata,0):
if '/begin' in line:
for index in range(num, len(filedata)):
if "store_address 59110 " in filedata[index]:
filedata.remove(filedata[index])
filedata.insert(filedata[index-7])
break
if "DESCRIPTION" in filedata[index]:
try:
filedata[index] = filedata[index].replace("\n", " ")
filedata[index 1] = filedata[index 1].replace(" ","").replace("\n", " ")
filedata[index 2] = filedata[index 2].replace(" ","").replace("\n", " ")
filedata[index 3] = filedata[index 3].replace(" ","").replace("\n", " ")
filedata[index 4] = filedata[index 4].replace(" ","").replace("\n", " ")
filedata[index 5] = filedata[index 5].replace(" ","").replace("\n", " ")
filedata[index 6] = filedata[index 6].replace(" ","").replace("\n", " ")
filedata[index 7] = filedata[index 7].replace(" ","").replace("\n", " ")
filedata[index 8] = filedata[index 8].replace(" ","")
except IndexError:
print("Error Index DESCRIPTION:", index, num)
if "DIMENSION" in filedata[index]:
try:
filedata[index] = filedata[index].replace("\n", " ")
filedata[index 1] = filedata[index 1].replace(" ","").replace("\n", " ")
filedata[index 2] = filedata[index 2].replace(" ","").replace("\n", " ")
filedata[index 3] = filedata[index 3].replace(" ","")
except IndexError:
print("Error Index DIMENSION:", index, num)
After which I write filedata
into another file.
This approach is taking too long to run(almost an hour and a half) because as mentioned earlier it is a large file. I was wondering if there was a faster approach to this issue
CodePudding user response:
You can read the file structure by structure so that you don't have to store the whole content in memory and manipulate it there. By structure, I mean all the values between and including /begin throw
and /end throw
. This should be much faster.
def rearrange_structure_and_write_into_file(structure, output_file):
# TODO: rearrange the elements in structure and write the result into output_file
current_structure = ""
with open(filename, 'r') as original_file:
with open(output_filename, 'w') as output_file:
for line in original_file:
current_structure = line
if "/end throw" in line:
rearrange_structure_and_write_into_file(current_structure, output_file)
current_structure = ""
CodePudding user response:
The insertion and removal of values from a long list is likely to make this code slower than it needs to be, and also makes it vulnerable to any errors and difficult to reason about. If there are any entries without store_address
then the code would not work correctly and would search through the remaining entries until it finds a store address.
A better approach would be to break down the code into functions that parse each entry and output it:
KEYWORDS = ["STORE_ADDRESS", "COLOR", "DESCRIPTION", "DIMENSION"]
def parse_lines(lines):
""" Parse throw data from lines in the old format """
current_section = None
r = {}
for line in lines:
words = line.strip().split(" ")
if words[0] in KEYWORDS:
if words[1:]:
r[words[0]] = words[1]
else:
current_section = r[words[0]] = []
else:
current_section.append(line.strip())
return r
def output_throw(throw):
""" Output a throw entry as lines of text in the new format """
yield "/begin throw"
for keyword in KEYWORDS:
if keyword in throw:
value = throw[keyword]
if type(value) is list:
value = " ".join(value)
yield f"{keyword} {value}"
yield "/end throw"
with open(filename) as in_file, open("output.txt", "w") as out_file:
entry = []
for line in in_file:
line = line.strip()
if line == "/begin throw":
entry = []
elif line == "/end throw":
throw = parse_lines(entry)
for line in output_throw(throw):
out_file.write(line "\n")
else:
entry.append(line)
Or if you really need to maximize performance by removing all unnecessary operations you could read, parse and write in a single long condition, like this:
with open(filename) as in_file, open("output.txt", "w") as out_file:
entry = []
in_section = True
def write(line):
out_file.write(line "\n")
for line in in_file:
line = line.strip()
first = line.split()[0]
if line == "/begin throw":
in_section = False
write(line)
entry = []
elif line == "/end throw":
in_section = False
for line_ in entry:
write(line_)
write(line)
elif first == "STORE_ADDRESS":
in_section = False
write(line)
elif line in KEYWORDS:
in_section = True
entry.append(line)
elif first in KEYWORDS:
in_section = False
entry.append(line)
elif in_section:
entry[-1] = " " line