I have a tsv file that looks like this:
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
I want to "clean" this file, for example, removing some columns:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
I thought about removing columns, however, the number of columns is different between rows. I´m using bash for that task. Is any better way to do that? For example, python or perl?
CodePudding user response:
This isn't particularly sophisticated, but it works with your example data:
# Create a working dir
mkdir -p taxdump
cd taxdump
# Create your 'example.tsv' file
cat <<EOF > example.tsv
BCM92732.1 sialidase Abditibacteriota bacterium
VTR99890.1 sialidase : Sialidase Precursor OS=Rhodopirellula baltica (strain SH1) GN=RB3353 PE=4 SV=1: BNR_2 Tuwongella immobilis
QEL17956.1 putative retaining sialidase Limnoglobus roseus
AMV31440.1 Sialidase precursor Pirellula sp. SH-Sr6A
EOF
# Download and unpack the NCBI taxonomy database (~50mb)
curl "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" -o taxdump.tar.gz
tar -zxvf taxdump.tar.gz
# Grab the second column (the species names) and clean up the whitespace
awk '
BEGIN{FS="|"}
$0 ~ "scientific name" {
gsub("\t", "", $2)
sub("^ ", "", $2)
print $2
}' names.dmp > bacteria_names
# Use AWK to cycle through bacteria_names (i.e. every species in the db)
# and check for matches with your tsv file ("example.tsv").
# Start by looking at the last four columns of example.tsv,
# then the last three columns, then the last two columns.
awk '
NR==FNR {
names[$0]
next
}
{
for (species in names)
{
four=$(NF-3)" "$(NF-2)" "$(NF-1)" "$NF
three=$(NF-2)" "$(NF-1)" "$NF
two=$(NF-1)" "$NF
if (species == four || species == three || species == two)
{
print $1, species
}
}
}' bacteria_names example.tsv
# Results:
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
CodePudding user response:
rquery can do this without much effort.
$rq -q "p d/ /r| select @1,foreach(%-1,%,$) | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
or
$rq -q "p d/ /r| select @1,@(n-1),@n | f @%>2" samples/test1.csv -m error
BCM92732.1 Abditibacteriota bacterium
VTR99890.1 Tuwongella immobilis
QEL17956.1 Limnoglobus roseus
AMV31440.1 Pirellula sp. SH-Sr6A
Check out the latest version from here: https://github.com/fuyuncat/rquery/releases