How can I use a file to fill in missing information from another file using grep or awk?-CodePudding

I have two .tsv files, the first significantly longer than the other. The second files serves as an index of values associated with the marker present in the first. How can I fill the information from the second file into the first? I figured I could use grep or awk, but I am stumped as to how seeing as I am a beginner. Any help is appreciated!

file1

item_1
item_1
item_2
item_3
item_4
item_4

file2

item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315

wanted outcome

item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315

CodePudding user response：

The join utility performs an equality join on the specified files and writes the result to the standard output.

join file1.tsv file2.tsv

CodePudding user response：

awk-based solution with

no pre-sorting of files required
file-2 de-duped on line-wide basis

{m,n,g}awk 'NF = NR==FNR ? _*  __[$ _] : __[substr($ _, !_, match($ _, 
                 ".[ \t]"))]* (__[$ _]  <NF)-(OFS="\n" $-_)' FS='^$' file1 file2

item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315

it behaves somewhat similarly to SQL SELECT statement INNER JOINs, but not necessarily identical ::

-- it returns the cartesian product of non-de-duped file1 by deduped-file2 ("DD2"), joined on $1 of both, even if $1 isn't unique within DD2

e.g. what "uniq -c" looked like after I randomized it, 
     plus altering value of last column in some of them :

           2 item_1 12 1002839 1231231
           2 item_1 12 1002839 1231232
           1 item_2 13 1006323 1621312
           1 item_2 13 1006323 1621313
           1 item_3 11 1052636 1931232
           1 item_3 11 1052636 1931233
           2 item_4 9 1024249 1631314
           2 item_4 9 1024249 1631315

CodePudding user response：

# print empty line if item does  not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{$0=a[$1]}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233

item_4     9    1024249    1631315

item_4     9    1024249    1631315

# print only item if item does not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{a[$1] ? $0=a[$1]: $0}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_7
item_4     9    1024249    1631315
item_5
item_4     9    1024249    1631315

# skip line if item does not exist in file2.tsv
# $ awk 'NR==FNR{a[$1]=$0; next}{if(a[$1]){$0=a[$1]}else{next}}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315