Removing elements of dataframe with different number of columns-CodePudding

I have a tsv file that looks like this:

BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A

I want to "clean" this file, for example, removing some columns:

BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella   immobilis
QEL17956.1 Limnoglobus  roseus
AMV31440.1 Pirellula    sp. SH-Sr6A

I thought about removing columns, however, the number of columns is different between rows. I´m using bash for that task. Is any better way to do that? For example, python or perl?

CodePudding user response：

This isn't particularly sophisticated, but it works with your example data:

# Create a working dir
mkdir -p taxdump
cd taxdump

# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A
EOF

# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz

# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
    gsub("\t", "", $2)
    sub("^ ", "", $2)
    print $2
}' names.dmp > bacteria_names

# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.

awk '
NR==FNR {
    names[$0]
    next
}
{
    for (species in names)
    {
        four=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
        three=$(NF-2)" "$(NF-1)" "$NF
        two=$(NF-1)" "$NF
        if (species == four || species == three || species == two) 
        {
            print $1, species
        }
     }
}' bacteria_names example.tsv

# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A

CodePudding user response：

rquery can do this without much effort.

$rq -q "p d/  /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

$rq -q "p d/  /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

Check out the latest version from here: https://github.com/fuyuncat/rquery/releases