Home > Software engineering >  Creating a new file based on three .tsv files in unix
Creating a new file based on three .tsv files in unix

Time:07-21

I have three very large files and I just want to grab the matching ID2s that come up for the first 20 ID2s for file 1 (in the example, I listed only 5). I want it to search file 2 and file 3 in its entirety and pull out the lines that match to the 20 ID2s from file1. I also want the ID1- even if it's blank (as well as other columns that are on the same matching ID2 line).

I know how to do this in python, but since the files are so large, my computer can't handle it. I have been trying to do it in the terminal (unix) with no luck. I tried the following command but 1)it doesn't deal with three files and 2) it is also throwing me an error (maybe I'm grabbing the wrong columns)?

awk ‘NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}’ file1.tsv file2.tsv

Error:

zsh: parse error near `}'

Another question: How do I skip the first five rows in unix when doing the command (because these files have some notes in the beginning that are read in as rows)?


file1 
     id 1             id2          name
    A00000004   B00000004          emily    
                B00000005          joe      
    A00000006   B00000006          jack     
                B00000007          john
    A00000008   B00000008          sally    

file2
    id1             id2     source    source_code
    A00000001   B00000001   source1    321
    A00000003   B00000003   source2    165
    A00000004   B00000004   source3    481
    A00000005   B00000005   source2    891

file 3
      id2         code
    B00000006    yes    
    B00000007    no     
    B00000005    yes        
    B00000004    yes
    B00000012    yes    
Desired: 
    id1             id2     name      source    source_code    code
    A00000004   B00000004    emily     source3    481          yes
    A00000005   B00000005    joe       source2    891          yes

CodePudding user response:

First, to start on the Nth line of a file:

tail -n N infile

To match the id2 columns in the different files, I'd recommend the join command. However, there are drawbacks, including the requirement to know which columns contain id2 and the need to sort on those columns. This method also won't recognize more than one identical column (e.g. the id1 column will be repeated for each file in which it appears). It's also unclear to me from the question if there are headers you want to keep. This method will lose the headers.

# Sort files on the id2 column
# First, drop the header line.
# Then sort on the common column (in this case the 2nd field).
# "-t $'\t'" means to use tabs as the field separator. 
tail -n  2 file1 | sort -t $'\t' -k2,2 > file1.sort
tail -n  2 file2 | sort -t $'\t' -k2,2 > file2.sort
tail -n  2 file3 | sort -t $'\t' -k1,1 > file3.sort

First join file1.sort to file2.sort:

# The "-j 2" flag means that both files use the second column as the common field.
# One could also use "-1 2 -2 2" to specify the field for the first and second files. 
join -t $'\t' -j 2 file1.sort file2.sort > file1-2

Then join that file to file3.sort. Note that the output of join places the joined column first, by default.

join -t $'\t' -j 1 file1-2 file3.sort > files1-3

You can save a step, and an intermediary file, by combining the last two commands:

join -t $'\t' -j 2 file1.sort file2.sort | join -t $'\t' -j 1 - file3.sort > files1-3
>cat files1-3
B00000004   A00000004   emily   A00000004   source3 481 yes
B00000005       joe A00000005   source2 891 yes

If you know that the columns are arranged the way they are in the example, you can also use join's output formatting to clean things up a bit:

join -t $'\t' -j 2 file1.sort file2.sort | join -t $'\t' -j 1 -o 1.4,1.1,1.3,1.5,1.6,2.2 - file3.sort > files1-3_format
>cat files1-3_format
A00000004   B00000004   emily   source3 481 yes
A00000005   B00000005   joe source2 891 yes
  • Related