Home > database >  How can I use a file to fill in missing information from another file using grep or awk?
How can I use a file to fill in missing information from another file using grep or awk?

Time:06-21

I have two .tsv files, the first significantly longer than the other. The second files serves as an index of values associated with the marker present in the first. How can I fill the information from the second file into the first? I figured I could use grep or awk, but I am stumped as to how seeing as I am a beginner. Any help is appreciated!

file1

item_1
item_1
item_2
item_3
item_4
item_4

file2

item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315

wanted outcome

item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315

CodePudding user response:

The join utility performs an equality join on the specified files and writes the result to the standard output.

join file1.tsv file2.tsv

CodePudding user response:

awk-based solution with

  • no pre-sorting of files required
  • file-2 de-duped on line-wide basis

|

{m,n,g}awk 'NF = NR==FNR ? _*  __[$ _] : __[substr($ _, !_, match($ _, 
                 ".[ \t]"))]* (__[$ _]  <NF)-(OFS="\n" $-_)' FS='^$' file1 file2

item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315

it behaves somewhat similarly to SQL SELECT statement INNER JOINs, but not necessarily identical ::

-- it returns the cartesian product of non-de-duped file1 by deduped-file2 ("DD2"), joined on $1 of both, even if $1 isn't unique within DD2

e.g. what "uniq -c" looked like after I randomized it, 
     plus altering value of last column in some of them :

           2 item_1 12 1002839 1231231
           2 item_1 12 1002839 1231232
           1 item_2 13 1006323 1621312
           1 item_2 13 1006323 1621313
           1 item_3 11 1052636 1931232
           1 item_3 11 1052636 1931233
           2 item_4 9 1024249 1631314
           2 item_4 9 1024249 1631315

CodePudding user response:

# print empty line if item does  not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{$0=a[$1]}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233

item_4     9    1024249    1631315

item_4     9    1024249    1631315

# print only item if item does not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{a[$1] ? $0=a[$1]: $0}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_7
item_4     9    1024249    1631315
item_5
item_4     9    1024249    1631315

# skip line if item does not exist in file2.tsv
# $ awk 'NR==FNR{a[$1]=$0; next}{if(a[$1]){$0=a[$1]}else{next}}1' file2.tsv file1.tsv
item_1    12    1002839    1231232
item_1    12    1002839    1231232
item_2    13    1006323    1621313
item_3    11    1052636    1931233
item_4     9    1024249    1631315
item_4     9    1024249    1631315 
  • Related