I am writing a script to parse out and tab-delimit a text report. The report has 17 columns with no delimiters, so my current solution is to write the lines to a list using specific position indexes which I am calling from variables.
Example code below is for 3 columns, but for the full report the datafile_lines.append()
line lists out all 17 columns (and 17 variables). It takes several minutes to process a report with 13,000 lines. I'm assuming there is a better way of doing this.
datafile_lines = []
col1 = [0,12]
col2 = [13,21]
col3 = [22,25]
with open("RawReport.txt","r") as datafile:
for line in datafile:
datafile_lines.append(line[col1[0]: col1[1]].strip() "\t"
line[col2[0]: col2[1]].strip() "\t"
line[col3[0]: col3[1]].strip() "\n")
CodePudding user response:
I can't explain several minutes for a modest 13K lines to process, but you can get some speedup by joining strings instead of concatenating them with
. In addition, the :
operator creates slice
objects for indexing. You can use slice
directly and make your code easier to write.
columns = [(0,12), (13,21), (22,25)]
col_slices = [slice(*pos) for pos in columns]
with open("RawReport.txt") as datafile:
datafile_lines = ["\t".join(line[col].strip() for col in col_slices)
for line in datafile]
CodePudding user response:
It really would help to have your input textfile. I created a textfile for testing purposes, which consists of 20000 lines and 30 characters in each line. Even though i find the solution from @tdelaney nice, i timed it as well as your original code with the following result for 100 runs:
original code: 1.06 s
tdelaneys code: 0.81 s
my code: 0.57 s
So here is my take. Replace file_name with your actual file:
col1 = [0, 12]
col2 = [13, 21]
col3 = [22, 25]
with open(file_name, "r") as file:
data = ["\t".join([f[0:12], f[13:21], f[22:25]]) "\n" for f in file]