Home > Back-end >  AWK for loop with two input files
AWK for loop with two input files

Time:07-06

I have the following files:

A-111.txt  
A-311.txt

B-111.txt  
B-311.txt


C-111.txt  
C-312.txt

 
D-112.txt  
D-311.txt

I want to merge lines of files with the same basename (same letter before the dash) if there is a match in column 4. I have many files so I want to do it in the loop.

So far I have this:

for f1 in *-1**.txt; do f2="${f1/-1/-3}"; awk -F"\t" 'BEGIN { OFS=FS } NR==FNR { a[$4,$4]=$0 ; next } ($4,$4) in a { print a[$4,$4],$0 }' "$f1" "$f2" > $f1_merged.txt; done

It works for files A and B as intended, but not for files C and D files. Can someone help me improve the code, please?


EDIT - here's the above code formatted legibly:

for f1 in *-1**.txt; do
    f2="${f1/-1/-3}"
    awk -F"\t" '
        BEGIN {
            OFS = FS
        }
        NR == FNR {
            a[$4, $4] = $0
            next
        }
        ($4, $4) in a {
            print a[$4, $4], $0
        }
    ' "$f1" "$f2" > $f1_merged.txt
done

EDIT - after Ed Morton kindly formatted my code, the error is:

awk: cmd. line:7: fatal: cannot open file 'C-311.txt' for reading (No such file or directory)

awk: cmd. line:7: fatal: cannot open file 'D-312.txt' for reading (No such file or directory)

CodePudding user response:

Would you please try the following:

#!/bin/bash

prefix="ref_"                                   # prefix to declare array variable names
declare -A bases                                # array to count files for the basename
for f in *-[0-9]*.txt; do                       # loop over the target files
    base=${f%%-*}                               # extract the basename
    declare -n ref="$prefix$base"               # indirect reference to an array named "$base"
    ref =("$f")                                 # create a list of filenames for the basename
    (( bases[$base]   ))                        # count the number of files for the basename
done

for base in "${!bases[@]}"; do                  # loop over the basenames
    if (( ${bases[$base]} == 2 )); then         # check if the number of files are two
        declare -n ref="$prefix$base"           # indirect reference
        IFS=$'\t' read -ra a < "${ref[0]}"      # read 1st file and assign array a to the columns
        IFS=$'\t' read -ra b < "${ref[1]}"      # read 2nd file and assign array b to the columns
        if [[ ${a[3]} = ${b[3]} ]]; then        # compare the 4th columns
            paste "${ref[@]}" > "${base}_merged.txt"
        fi
    fi
done
  • First extract the basenames such as "A", "B", .. then create a list of associated filenames. For instance, the array "A" will be assigned to ('A-111.txt' 'A-311.txt'). At the same time, the array bases counts the files for each basename.
  • Then loop over the basenames, make sure the number of associated files are two, compare the 4th columns of the files. If they match, concatenate the files to generate a new file.
  • paste "${ref[@]}" concatenates the lines of two files side by side delimited by a tab.
  • Related