Dealing with the analysis of multi-column data, organized in the following format:
#Acceptor DonorH Donor Frames Frac AvgDist AvgAng
lig_608@O2 GLU_166@H GLU_166@N 708 0.7548 2.8489 160.3990
lig_608@O3 THR_26@H THR_26@N 532 0.5672 2.8699 161.9043
THR_26@O lig_608@H15 lig_608@N6 414 0.4414 2.8509 153.3394
lig_608@N2 HIE_163@HE2 HIE_163@NE2 199 0.2122 2.9167 156.3248
GLN_189@OE1 lig_608@H2 lig_608@N4 32 0.0341 2.8899 156.4308
THR_25@OG1 lig_608@H14 lig_608@N5 26 0.0277 2.8906 160.9933
lig_608@O4 GLY_143@H GLY_143@N 25 0.0267 2.8647 146.5977
lig_608@O3 THR_25@HG1 THR_25@OG1 16 0.0171 2.7618 152.3421
lig_608@O2 GLN_189@HE21 GLN_189@NE2 15 0.0160 2.8947 154.3567
lig_608@N7 ASN_142@HD22 ASN_142@ND2 10 0.0107 2.9196 147.8856
lig_608@O4 ASN_142@HD21 ASN_142@ND2 9 0.0096 2.8462 148.4038
HIE_41@O lig_608@H14 lig_608@N5 9 0.0096 2.8693 148.4560
GLN_189@NE2 lig_608@H2 lig_608@N4 7 0.0075 2.9562 153.6447
lig_608@O4 ASN_142@HD22 ASN_142@ND2 4 0.0043 2.8954 158.0293
THR_26@O lig_608@H14 lig_608@N5 2 0.0021 2.8259 156.4279
lig_608@O4 ASN_119@HD21 ASN_119@ND2 1 0.0011 2.8786 144.1573
lig_608@N2 GLU_166@H GLU_166@N 1 0.0011 2.9295 149.3281
My gnuplot script integrated into BASH filters data, selecting only two columns matching the conditions: 1) either index from the 1st or 3rd column excluding pattern started from "lig"; 2) values from the 5th column that are > 0.05
#!/bin/bash
output=$(pwd)
# begining pattern of each processed file
target='HBavg'
# loop each file and create a bar graph
for file in "${output}"/${target}*.log ; do
file_name3=$(basename "$file")
file_name2="${file_name3/.log/}"
file_name="${file_name2/${target}_/}"
echo "vizualisation with Gnuplot!"
cat <<EOS | gnuplot > ${output}/${file_name2}.png
set term pngcairo size 800,600
### conditional xtic labels
reset session
set termoption noenhanced
set title "$file_name" font "Century,22" textcolor "#b8860b"
set tics font "Helvetica,10"
FILE = "$file"
set xlabel "Fraction, %"
set ylabel "H-bond donor, residue"
set yrange [0:1]
set key off
set style fill solid 0.5
set boxwidth 0.9
set grid y
#set xrange[-1:5]
set table \$Filtered
myTic(col1,col2) = strcol(col1)[1:3] eq 'lig' ? strcol(col2) : strcol(col1)
plot FILE u ((y0=column(5))>0.05 ? sprintf("%g %s",y0,myTic(1,3)) : '') w table
unset table
plot \$Filtered u 0:1:xtic(2) w boxes, '' u 0:1:1 w labels offset 0,1
### end of script
EOS
done
eventually it writes filtered data into a new table producing a multi-bar plot which looks like:
As we may see here the bars are pre-sorted according to the values on Y (corresponded to the values from the 5th column of initial data). How would it be possible rather to sort bars according to the alphabetic order of the naming patterns displayed on X (eventually changing the order of the displayed bars on the graph)?
Since the original data is alway sorted according to the 5th column (Frac), would it be possible to resort it directly providing to Gnuplot ?
the idea may be to pipe it directly in gnuplot script with awk and sort e.g:
plot "<awk -v OFS='\t' 'NR > 1 && \$5 > 0.05' $file | sort -k1,1" using 0:5:xtic(3) with boxes
how could I do the same with my script (where the data is filtered using gnuplot and I need only to sort the bars produced via):
plot \$Filtered u 0:1:xtic(2) w boxes, '' u 0:1:1 w labels offset 0,1
CodePudding user response:
edit: added color alternation
I would stick to external tools for processing the data then call gnuplot
:
#!/bin/bash
{
echo '$data << EOD'
awk 'NR > 1 && $5 > 0.05 {print ($1 ~ /^lig/ ? $2 : $1 ), $5}' file.log |
sort -t ' ' -k1,1 |
awk -v colors='0x4472c4 0xed7d31' '
BEGIN { nc = split(colors,clrArr) }
{ print $0, clrArr[NR % nc 1] }
'
echo 'EOD'
cat << 'EOF'
set term pngcairo size 800,600
set title "file.log" font "Century,22" textcolor "#b8860b"
set xtics noenhanced font "Helvetica,10"
set xlabel "H-bond donor, residue"
set ylabel "Fraction, %"
set yrange [0:1]
set key off
set boxwidth 0.9
set style fill solid 1.0
plot $data using 0:2:3:xtic(1) with boxes lc rgb var, \
'' using 0:2:2 with labels offset 0,1
EOF
} | gnuplot > file.png
remarks:
- The problem with printing the values on top of the bars in Gnuplot is that you can't do it directly from a stream, you need a file or a variable. Here I saved the input data into the
$data
variable. - You'll be able to expand shell variables in the HEREDOC if you unquote it (
<< 'EOF'
=><< EOF
), but you have to make sure that you escape the$
of$data
- The simplest way to add colors is to add a "color" field in the output of
awk
but the sorting would mess it up; that's why I add the color in an otherawk
after thesort
.