I am still quite a beginner and all threads I found that sounded like my problme were SQL related or the answer was join or merge, which does not work because the identifier is present multiple times in file2. I have two tab delimited files. One contains two columns, one with a unique numerical identifier, the other with a classification. There are thousands of them. The second one contains my data with one field containing an identifier. There are hundreds of lines, so many identifiers are not present while others are present multiple times. I need to connect my data stored in file2 with the classification stored in file1, based in the identifier in both files.
file 1:
12345 kitchen; furniture; table
12346 kitchen; furniture; chair
12347 living room; furniture; sofa
12348 living room; furniture; table
12349 bed room; furniture; bed
file 2:
stuff1 mo_restuff somenumbers anotherfield 12348
stuff2 morestuff othernumbers anotherfield 12346
stuff3 more_stuff somenumbers anotherfield 12347
stuff4 morestuff somenumbers yetanotherfield 12347
stuff5 morest.uff alsonumbers anotherfield 12345
The result should look like:
stuff1 mo_restuff somenumbers anotherfield living room; furniture; table
stuff2 morestuff othernumbers anotherfield kitchen; furniture; chair
stuff3 more_stuff somenumbers anotherfield living room; furniture; sofa
stuff4 morestuff somenumbers yetanotherfield living room; furniture; sofa
stuff5 morest.uff alsonumbers anotherfield kitchen; furniture; table
I tried (amongst many other things)
awk -F "\t" 'BEGIN { OFS=FS } NR==FNR { a[$1]=$0 ; next } ($5) in a { print a,$0 } ' file1 file2 > out
but this just printed file2.
I am working on Unix, best would be a solution in bash but python would also work.
Thanks also for all the answers to other questions that helped me before!
CodePudding user response:
Your script is close, but the action { print a,$0 }
is strange.
Would you please try the following:
awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {a[$1] = $2; next}
{$5 = a[$5]}
1' file1 file2 > out
Output:
stuff1 mo_restuff somenumbers anotherfield living room; furniture; table
stuff2 morestuff othernumbers anotherfield kitchen; furniture; chair
stuff3 more_stuff somenumbers anotherfield living room; furniture; sofa
stuff4 morestuff somenumbers yetanotherfield living room; furniture; sofa
stuff5 morest.uff alsonumbers anotherfield kitchen; furniture; table
- In the first block for the condition
FR==FNR
, storing$2
will be good enough, not$0
. - In the next block for
file2
, just modify the 5th field by replacing with the value ofa
indexed by$5
. - The final
1
tells awk to print$0
, of which the 5th field is modified as above.