Parse text with uncertain number of fields-CodePudding

I have a file (~50,000 lines) text.txt as below, which contains some gene info from five individuals (AB, BB, CA, DD, GG). The \t in the file is a tab seperator. There are also a lot of info that are not useful in the file, and I would like to clean it up. So What I need is to extract the species name with 'transcript=' id, if they have one, and also extract the 'DD:' and 'GG:' parts.

$head text.txt

GeneA\tAB:xrbyk | jdnif | otherText\tBB:abdf | jdhfkc | otherDifferentText\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneB\tBB:dfsfg | dfsfvdf | otherDifferent\tCA:zdcfsdf | xfgdfgs | transcript=sdfs.1\tDD:sdfsw.1 type=cds\tGG:fghfg.1 type=cds
GeneC\tAB:dsfsdf | xdvv | otherText1\tBB:xdsd | sdfsdf | otherDifferentText2\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneD\tAB:dfsdf | Asda | transcript=asdasd.2\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds

and I would like the output to be

GeneA\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1
GeneB\tCA:transcript=sdfs.1\tDD:sdfsw.1\tGG:fghfg.1
GeneC\tDD:hudnf.1\tGG:jdubf.1
GeneD\tAB:transcript=asdasd.2\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1

I have been searching and thinking for a whole afternoon already, and only have the idea of tearing this file apart by species with the first column Gene name, then separately modify each file, and finally merge files together according to the gene name. But as you see, each line of the file does not necessary have the same number of fields, and so I can't simply use awk to print a certain column. I'm not sure how I can tear them up by species.

I tried to mimic the use of this one How to use sed/grep to extract text between two words?, but did not come with success. I also read a bit about Python in how to split text, (as I'm starting to learn this language), but still can't figure it out. Could anyone please help? Thanks a lot!

UPDATE OF CLARIFICATION OF THE INPUT DATA: In the example showed above, the gene info of each individual is separated by tab (\t), which means that all the text after the inidividual name plus colon (e.g. AB:) belongs to the individual (e.g. "xrbyk | jdnif | otherText" for AB). Whether to keep the individual in the final output depends on whether there is the information of "transcript=" for that individual, except for DD and GG. This is why in the final output the 1st line start with CA but not with AB.

CodePudding user response：

Assuming those \t in your sample text are real tabs, this Perl one liner will do it. If they're literal \t text then this needs to be tweaked a tad. Put each field you want to grab in the regex alternation after GG:.

perl -lne '@wanted = $_ =~ m{(^Gene[ABCD]|(?:transcript=|DD:|GG:)\S )}g; print join "\t", @wanted if @wanted; ' inputfile.txt > outputfile.txt

Output:

GeneA   transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1
GeneB   transcript=sdfs.1       DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   transcript=asdasd.2     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1

CodePudding user response：

This solution is a bit long, but should be easy to work with:

#!/usr/bin/env python3

# main.py

import csv
import fileinput
import re


def filter_fields(row):
    output = []
    for field_number, field in enumerate(row, 1):
        if field_number == 1:
            output.append(field)
        elif "DD:" in field or "GG:" in field:
            output.append(field.split()[0])
        elif "transcript=" in field:
            # Remove stuff from after the colon to the last space
            output.append(re.sub(r":.* ", ":", field))

    return "\t".join(output)


reader = csv.reader(fileinput.input(), delimiter="\t")
for row  in reader:
    print(filter_fields(row))

How to run it:

# Output to the screen
python3 main.py text.txt

# Output to a file
python3 main.py text.txt > out.txt

# Use as a filter
cat text.txt | python3 main.py

Notes

In this solution, each line of text is broken into a row of fields.
The function filter_fields will take each row, decide what field to to keep and reformat. It then return those fields, tab separated.
The re.sub(...) call says: Delete everything after the colon, up to the last space.