Home > front end >  Extracting data by tabs in a TSV with Python
Extracting data by tabs in a TSV with Python

Time:06-20

I have a TSV file with the data separated like this

class  (this is tab header)
256-56-8411  Bob      3.61  junior     cs
471-44-7458  Tim      3.45  senior     ce
326-56-4286  Rajesh   2.97  freshman   te
548-66-1124  Eric     2.88  sophomore  ee
447-21-3599  John     2.51  junior     cs
911-41-1256  Rebecca  3.92  senior     cs
854-22-6372  Robin    2.45  freshman   te

criteria (this is tab header)
A   B   C   D   D   1
A   B   C   D   E   2
S   D   F   G   H   2
S   D   F   G   B   4

class  (this is tab header)
447-21-3599  John     2.51  junior     cs
911-41-1256  Rebecca  3.92  senior     cs
854-22-6372  Robin    2.45  freshman   te
256-56-8411  Bob      3.61  junior     cs
471-44-7458  Tim      3.45  senior     ce
326-56-4286  Rajesh   2.97  freshman   te
548-66-1124  Eric     2.88  sophomore  ee



criteria (this is tab header)
S   D   F   G   H   2
S   D   F   G   B   4
A   B   C   D   D   1
A   B   C   D   E   2

I need help writing a python script that will extract and merge all of the data for each tab into One file for respective tabs . One for class , another for tab criteria. The out dataset will look something like this.

output file for class
256-56-8411  Bob      3.61  junior     cs
471-44-7458  Tim      3.45  senior     ce
326-56-4286  Rajesh   2.97  freshman   te
548-66-1124  Eric     2.88  sophomore  ee
447-21-3599  John     2.51  junior     cs
911-41-1256  Rebecca  3.92  senior     cs
854-22-6372  Robin    2.45  freshman   te
447-21-3599  John     2.51  junior     cs
911-41-1256  Rebecca  3.92  senior     cs
854-22-6372  Robin    2.45  freshman   te
256-56-8411  Bob      3.61  junior     cs
471-44-7458  Tim      3.45  senior     ce
326-56-4286  Rajesh   2.97  freshman   te
548-66-1124  Eric     2.88  sophomore  ee


output file for criteria
A   B   C   D   D   1
A   B   C   D   E   2
S   D   F   G   H   2
S   D   F   G   B   4
S   D   F   G   H   2
S   D   F   G   B   4
A   B   C   D   D   1
A   B   C   D   E   2

Thanks for your anticipated help.

CodePudding user response:

Given the file you gave this will sort your TSV into the two lists.

It will fail if there is any not empty line above one of the * (this is tab header) lines.

NB1. as I wrote it right now it will replace the original file with the new sorted file.

NB2. this reads the file in memory, if you're working with huge files it might be best to read/write to a pair of tempfile.TemporaryFile and then concatenate those.

lines = {"class": [], "criteria": []}

with open("file.tsv", "r") as f:
    while line := f.readline():
        if line == "\n":
            continue
        elif line == "class  (this is tab header)\n":
            current = "class"
        elif line == "criteria (this is tab header)\n":
            current = "criteria"
        else:
            lines[current].append(line)

with open("file.tsv", "w") as f:  # note this truncates the file to 0
    f.write("class  (this is tab header)\n")
    f.writelines(lines["class"])
    f.write("criteria (this is tab header)\n")
    f.writelines(lines["criteria"])

CodePudding user response:

You could use a dictionary to hold suitable file handles and write the lines as they are read in. This would then work on any file size:

output = {
    'class' : open('class.tsv', 'w'),
    'criteria' : open('criteria.tsv', 'w')
}

# Write suitable headers
for header, handle in output.items():
    handle.write(f"{header} (this is a tab header)\n")  

with open('input.tsv') as f_input:
    for line in f_input:
        if line.strip():   # skip any blank lines
            if line.startswith('class'):
                mode = 'class'
            elif line.startswith('criteria'):
                mode = 'criteria'
            else:
                output[mode].write(line)
                
for handle in output.values():
    handle.close()
  • Related