Home > OS >  Removing elements of dataframe with different number of columns
Removing elements of dataframe with different number of columns

Time:08-25

I have a tsv file that looks like this:

BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A

I want to "clean" this file, for example, removing some columns:

BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella   immobilis
QEL17956.1 Limnoglobus  roseus
AMV31440.1 Pirellula    sp. SH-Sr6A

I thought about removing columns, however, the number of columns is different between rows. I´m using bash for that task. Is any better way to do that? For example, python or perl?

CodePudding user response:

This isn't particularly sophisticated, but it works with your example data:

# Create a working dir
mkdir -p taxdump
cd taxdump

# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1  sialidase   Abditibacteriota    bacterium
VTR99890.1  sialidase   :   Sialidase   Precursor   OS=Rhodopirellula   baltica (strain SH1)    GN=RB3353   PE=4    SV=1:   BNR_2   Tuwongella  immobilis
QEL17956.1  putative    retaining   sialidase   Limnoglobus roseus
AMV31440.1  Sialidase   precursor   Pirellula   sp. SH-Sr6A
EOF

# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz

# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
    gsub("\t", "", $2)
    sub("^ ", "", $2)
    print $2
}' names.dmp > bacteria_names

# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.

awk '
NR==FNR {
    names[$0]
    next
}
{
    for (species in names)
    {
        four=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
        three=$(NF-2)" "$(NF-1)" "$NF
        two=$(NF-1)" "$NF
        if (species == four || species == three || species == two) 
        {
            print $1, species
        }
     }
}' bacteria_names example.tsv

# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A

CodePudding user response:

rquery can do this without much effort.

$rq -q "p d/  /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

or

$rq -q "p d/  /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1       Abditibacteriota       bacterium
VTR99890.1       Tuwongella     immobilis
QEL17956.1       Limnoglobus    roseus
AMV31440.1       Pirellula       sp. SH-Sr6A

Check out the latest version from here: https://github.com/fuyuncat/rquery/releases

  • Related