I have a tab separated file that looks like this:
4S2P_1:A 4S2P_1:A
4S2P_1:A 6PXX_1:A
4S2P_1:A 6HB8_1:A
4S2P_1:A 6HOO_1:A
4S2P_1:A 6I5D_1:A
4S2R_1:A 4S2R_1:A
4S2C_1:A 4S2C_1:A
4S2C_1:A 4S2B_1:A
4S2E_1:A 4S2E_1:A
4S2E_1:A 5XB5_1:A
4S2E_1:A 5XBH_1:A
The file is created so that in the second column are the sequences similar to the ones in the first column. 4S2P_1:A is similar to itself and 6Q5B_1:A and 6PXX_1:A and 6HB8_1:A and so on. 4S2R_1:A is just similar to itself.
I want to parse the file to look like this:
4S2P_1:A 6PXX_1:A 6HB8_1:A 6HOO_1:A 6I5D_1:A
4S2E_1:A 5XB5_1:A 5XBH_1:A
4S2C_1:A 4S2B_1:A
4S2R_1:A
So I want the output to have the first column and the ones linked to it separated by a space on one line and to have the formed clusters in a decreased order.
I would like to use awk to do this.
I tried using this:
awk -F '\t' '{print $1*" "$2}'
But it gives me this output:
04S2P_1:A
05DTT_1:A
07ASS_1:A
07AUX_1:A
05HAQ_1:A
05HAP_1:A
05HAR_1:A
It adds a 0 at the beginning and doesn't keep the similar sequences on the same line.
CodePudding user response:
Typically a hash is used to make a list unique.
#! /bin/bash
declare -A hash
while read -r c1 c2; do
hash[$c1] =$'\t'"$c2"
done
for key in "${!hash[@]}"; do
printf '%s%s\n' "$key" "${hash[$key]}"
done
The disadvantage is, that you loose the original sort order. But it seems to me that you do not care about the original order. If you want to sort the output by the length of each line, you can take one of the answers to that question.
CodePudding user response:
Here is a simple Awk script to lift values with the same key to the same line.
awk '$1 != prev { if(prev) printf "\n";
prev=$1; printf "%s", $2; next }
{ printf " %s", $2 }
END { if (prev) printf "\n" }' file
To sort by the length of each record, you will need to keep things in memory while reading. The above is attractive for its simplicity and robustness (should work for files of any size) but we can make it a little bit more involved to print a sort key in front of each line, at the cost of needing to keep each complete record in memory until we know its length.
awk 'function pr () { printf "%i\t", n; printf "%s", a[1];
for(i=2; i<=n; i) printf " %s", a[i];
printf "\n"; delete a; n=0 }
$1 != prev { if (prev) pr(); prev=$1; a[1]=$2; n=1; next }
{ a[ n] = $2 }
END { if (n) pr() }' file |
sort -t $'\t' -k1rn |
cut -f2-