Home > Software engineering >  Tab-Delimiting a Non-Delimited Text Report
Tab-Delimiting a Non-Delimited Text Report

Time:10-04

I am writing a script to parse out and tab-delimit a text report. The report has 17 columns with no delimiters, so my current solution is to write the lines to a list using specific position indexes which I am calling from variables.

Example code below is for 3 columns, but for the full report the datafile_lines.append() line lists out all 17 columns (and 17 variables). It takes several minutes to process a report with 13,000 lines. I'm assuming there is a better way of doing this.

datafile_lines = []
col1 = [0,12]
col2 = [13,21]
col3 = [22,25]

with open("RawReport.txt","r") as datafile:
    for line in datafile:
        datafile_lines.append(line[col1[0]: col1[1]].strip() "\t" 
                              line[col2[0]: col2[1]].strip() "\t" 
                              line[col3[0]: col3[1]].strip() "\n")

CodePudding user response:

I can't explain several minutes for a modest 13K lines to process, but you can get some speedup by joining strings instead of concatenating them with . In addition, the : operator creates slice objects for indexing. You can use slice directly and make your code easier to write.

columns = [(0,12), (13,21), (22,25)]
col_slices = [slice(*pos) for pos in columns]

with open("RawReport.txt") as datafile:
    datafile_lines = ["\t".join(line[col].strip() for col in col_slices)  
            for line in datafile]

CodePudding user response:

It really would help to have your input textfile. I created a textfile for testing purposes, which consists of 20000 lines and 30 characters in each line. Even though i find the solution from @tdelaney nice, i timed it as well as your original code with the following result for 100 runs:

original code:   1.06 s
tdelaneys code:  0.81 s
my code:         0.57 s

So here is my take. Replace file_name with your actual file:

col1 = [0, 12]
col2 = [13, 21]
col3 = [22, 25]

with open(file_name, "r") as file:
    data = ["\t".join([f[0:12], f[13:21], f[22:25]]) "\n" for f in file]

Here is a link to the test-textfile i used

  • Related