I have two .tsv files, the first significantly longer than the other. The second files serves as an index of values associated with the marker present in the first. How can I fill the information from the second file into the first? I figured I could use grep or awk, but I am stumped as to how seeing as I am a beginner. Any help is appreciated!
file1
item_1
item_1
item_2
item_3
item_4
item_4
file2
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_4 9 1024249 1631315
wanted outcome
item_1 12 1002839 1231232
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_4 9 1024249 1631315
item_4 9 1024249 1631315
CodePudding user response:
The join
utility performs an equality join on the specified files and writes the result to the standard output.
join file1.tsv file2.tsv
CodePudding user response:
awk
-based solution with
- no pre-sorting of files required
file-2
de-duped on line-wide basis
|
{m,n,g}awk 'NF = NR==FNR ? _* __[$ _] : __[substr($ _, !_, match($ _,
".[ \t]"))]* (__[$ _] <NF)-(OFS="\n" $-_)' FS='^$' file1 file2
item_1 12 1002839 1231232
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_4 9 1024249 1631315
item_4 9 1024249 1631315
it behaves somewhat similarly to SQL
SELECT
statement INNER JOIN
s, but not necessarily identical ::
-- it returns the cartesian product of non-de-duped file1
by deduped-file2
("DD2"
), joined on $1
of both, even if $1
isn't unique within DD2
e.g. what "uniq -c" looked like after I randomized it,
plus altering value of last column in some of them :
2 item_1 12 1002839 1231231
2 item_1 12 1002839 1231232
1 item_2 13 1006323 1621312
1 item_2 13 1006323 1621313
1 item_3 11 1052636 1931232
1 item_3 11 1052636 1931233
2 item_4 9 1024249 1631314
2 item_4 9 1024249 1631315
CodePudding user response:
# print empty line if item does not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{$0=a[$1]}1' file2.tsv file1.tsv
item_1 12 1002839 1231232
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_4 9 1024249 1631315
item_4 9 1024249 1631315
# print only item if item does not exist in file2.tsv
$ awk 'NR==FNR{a[$1]=$0; next}{a[$1] ? $0=a[$1]: $0}1' file2.tsv file1.tsv
item_1 12 1002839 1231232
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_7
item_4 9 1024249 1631315
item_5
item_4 9 1024249 1631315
# skip line if item does not exist in file2.tsv
# $ awk 'NR==FNR{a[$1]=$0; next}{if(a[$1]){$0=a[$1]}else{next}}1' file2.tsv file1.tsv
item_1 12 1002839 1231232
item_1 12 1002839 1231232
item_2 13 1006323 1621313
item_3 11 1052636 1931233
item_4 9 1024249 1631315
item_4 9 1024249 1631315