How to substitute data in one field with data from different file based on common identifier (in ano-CodePudding

I am still quite a beginner and all threads I found that sounded like my problme were SQL related or the answer was join or merge, which does not work because the identifier is present multiple times in file2. I have two tab delimited files. One contains two columns, one with a unique numerical identifier, the other with a classification. There are thousands of them. The second one contains my data with one field containing an identifier. There are hundreds of lines, so many identifiers are not present while others are present multiple times. I need to connect my data stored in file2 with the classification stored in file1, based in the identifier in both files.

file 1:

12345   kitchen; furniture; table
12346   kitchen; furniture; chair
12347   living room; furniture; sofa
12348   living room; furniture; table
12349   bed room; furniture; bed

file 2:

stuff1  mo_restuff  somenumbers anotherfield    12348
stuff2  morestuff   othernumbers    anotherfield    12346
stuff3  more_stuff  somenumbers anotherfield    12347
stuff4  morestuff   somenumbers yetanotherfield 12347
stuff5  morest.uff  alsonumbers anotherfield    12345

The result should look like:

stuff1  mo_restuff  somenumbers anotherfield    living room; furniture; table
stuff2  morestuff   othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff  somenumbers anotherfield    living room; furniture; sofa
stuff4  morestuff   somenumbers yetanotherfield living room; furniture; sofa
stuff5  morest.uff  alsonumbers anotherfield    kitchen; furniture; table

I tried (amongst many other things)

awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a  { print a,$0 } ' file1 file2 > out

but this just printed file2.

I am working on Unix, best would be a solution in bash but python would also work.

Thanks also for all the answers to other questions that helped me before!

CodePudding user response：

Your script is close, but the action { print a,$0 } is strange. Would you please try the following:

awk '
    BEGIN {FS = OFS = "\t"}
    NR==FNR {a[$1] = $2; next}
    {$5 = a[$5]}
1' file1 file2 > out

Output:

stuff1  mo_restuff      somenumbers     anotherfield    living room; furniture; table
stuff2  morestuff       othernumbers    anotherfield    kitchen; furniture; chair
stuff3  more_stuff      somenumbers     anotherfield    living room; furniture; sofa
stuff4  morestuff       somenumbers     yetanotherfield living room; furniture; sofa
stuff5  morest.uff      alsonumbers     anotherfield    kitchen; furniture; table

In the first block for the condition FR==FNR, storing $2 will be good enough, not $0.
In the next block for file2, just modify the 5th field by replacing with the value of a indexed by $5.
The final 1 tells awk to print $0, of which the 5th field is modified as above.