Home > Mobile >  Complete file2 with data from file1
Complete file2 with data from file1

Time:12-07

I have two files with fields separated with tabs:

  1. File1 has 13 columns and 90 millions of lines (~5GB). The number of lines of file 1 is always smaller than the number of lines of file2.
1   1   27  0   2   0   0   1   0   0   0   1   false
1   2   33  0   3   0   0   0   0   0   0   1   false
1   5   84  3   0   0   0   0   0   0   0   2   false
1   6   41  0   1   0   0   0   0   0   0   1   false
1   7   8   4   0   0   0   0   0   0   0   1   false
  1. File2 has 2 columns and 100 millions of lines (1.3GB)
1   1
1   2
1   3
1   4
1   5

What I want to achieve:

When the pair columns $1/$2 of file2 is identical to the pair columns $1/$2 of file1, I would like to print $1 and $2 from file2 and $3 from file1 into an output file. In addition, if the pair $1/$2 of file2 does not have a match in file1, print $1/$2 in the output and the 3rd column is left empty. Thus, the output keeps the same structure (number of lines) than file2.

If relevant: The pairs $1/$2 are unique in both file1 and 2 and both files are sorted according $1 first and then $2.

Output file:

1   1   27
1   2   33
1   3   45
1   4
1   5   84

What I have done so far:

awk -F"\t" 'NR == FNR {a[$1 "\t" $2] = $3; next } { print $0 "\t" a[$1 "\t" $2] }' file1 file2 > output

The command runs for few minutes and unexpectedly stop without additional information. When I open the output file, the first 5 to 6x10E6 lines have been correctly processed (I can see the 3rd column that was correctly added) but the rest of the output file does not have a 3rd column. I am running this command on a 3.2 GHz Intel Core i5 with 32 GB 1600 MHz DDR3. Any ideas why the command stops? Thanks for your help.

CodePudding user response:

You are close.

I would do something like this:

awk 'BEGIN{FS=OFS="\t"}
    {key=$1 FS $2}
    NR==FNR{seen[key]=$3; next}
    key in seen {print key, seen[key]}
' file1 file2

Or, since file1 is bigger, reverse which file is held in memory:

awk 'BEGIN{FS=OFS="\t"}
    {key=$1 FS $2}
    NR==FNR{seen[key]; next}
    key in seen {print key, $3}
' file2 file1

You could also use join which will likely handle files much larger than memory. This is BSD join that can use multiple fields for the join:

join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2

join requires the files be sorted, as your example is. If not sorted, you could do:

join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 <(sort -n file1) <(sort -n file2)

Or, if your join can only use a single field, you can temporarily use ' ' as the field separator between field 2 and 3 and set join to use that as the delimiter:

join -1 1 -2 1 -t $' ' -o 1.1,2.2  <(sort -k1n -k2n file2) <(awk '{printf("%s\t%s %s\n",$1,$2,$3)}' file1 | sort -k1n -k2n) | sed 's/[ ]/\t/'

Either awk or join prints:

1   1   27
1   2   33
1   3   45
1   4   7
1   5   84

Your comment:

After additional investigations, the suggested solutions did not worked because my question was not properly asked (my mistake). The suggested solutions printed lines only when matches between pairs ($1/$2) were found between files 1 and 2. Thus, the resulting output file has always the number of lines of file1 (that is always smaller than file2). I want the output file to keep the same structure than file2, as said, the same number of lines (for further comparison). The question was further refined.

If your computer can handle the file sizes:

awk 'BEGIN{FS=OFS="\t"}
    {key=$1 FS $2}
    NR==FNR{seen[key]=$3; next}
    {if (key in seen)
        print key, seen[key]
    else
        print key
    }
' file1 file2

Otherwise you can filter file1 so only the matches are feed to the awk from file1 and then file2 dictates the final output structure:

awk 'BEGIN{FS=OFS="\t"}
    {key=$1 FS $2}
    NR==FNR{seen[key]=$3; next}
    {if (key in seen)
        print key, seen[key]
    else
        print key
    }
' <(join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2) file2

CodePudding user response:

Assumptions/understandings:

  • both input files have already been sorted by the 1st and 2nd columns
  • within a given file the combination of the 1st and 2nd columns represents a unique key (ie, there are no duplicate keys in a file)
  • all lines from file2 are to be written to stdout while an optional 3rd column will be derived from file1 (when the key exists in both files)

One awk idea that implements a 'merge join' operation (ie, read both files in parallel):

awk -v lookup="file1" '                               # assign awk variable "lookup" the name of the file where we will obtain the optional 3rd column from

function get_lookup() {                               # function to read a new line from the "lookup" file
    rc=getline line < lookup                          # read next line from "lookup" file into variable "line"; rc==1 if successful; rc==0 if reached end of file
    if (rc) split(line,f)                             # if successful (rc==1) then split "line" into array f[]
}

BEGIN  { FS=OFS="\t"                                  # define input/output field delimiters
         get_lookup()                                 # read first line from "lookup" file
       }
       { c3=""                                        # set optional 3rd column to blank

         while (rc) {                                 # while we have a valid line from the "lookup" file look for a matching line ...

                    if ($1< f[1]            ) {                             break }
               else if ($1==f[1] && $2< f[2]) {                             break }
               else if ($1==f[1] && $2==f[2]) { c3= OFS f[3]; get_lookup(); break }
               else if ($1==f[1] && $2> f[2]) {               get_lookup()        }
               else if ($1> f[1]            ) {               get_lookup()        }
         }

         print $0 c3                                 # print current line plus optional 3rd column
       }
' file2

NOTE: memory usage should not be an issue since we only have 2 lines (one from each file) stored in memory at any given time

This generates:

1       1       27
1       2       33
1       3
1       4
1       5       84
  • Related