Merge unsorted lines from two files based on similar part-CodePudding

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.

file 1:

>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...

file2:

301-89--sample1
301-88--sample2
...

output:

>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2

The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?

CodePudding user response：

Using awk

$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2

CodePudding user response：

I am wondering if is it possible to merge information from two files together based on a similar part

Yes ...

The files are unsorted

... but only if they're sorted.

It's easier if we transform them so the delimiters are consistent, and then format it back together later:

sed 's/>$[0-9]*-[0-9]*$_$.*$$/\1 \2/' file1 produces
```
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
```
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
```
301-89 sample1
301-88 sample2
...
```
which we can sort the same way

join sorted1 sorted2 (with the sorted results of the previous steps) produces

301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...

and finally we can format those 3 fields as you originally wanted, by piping through

sed 's/$.*$ $.*$ $.*$$/\1_\2--\3/'

If it's reasonable to sort them on the fly, we can just do that using process substitution:

$ join \
     <( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
     <( sed 's/--/ /' f2 | sort -k1 ) \
      | sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'

301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...

If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

CodePudding user response：

(mawk/nawk/gawk -e/-ce/-Pe) '

   FNR == !_ {
      _ = !  ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
     $_ = $_ 
 } FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt

———————————————————————————

        >301-89_IDNAGNDJ_171582--sample1
        >301-88_ALPEKDJF_112039--sample2
        >301-88_ALPEKDJF_119660--sample2