Home > Enterprise >  Linux: assigning character ids from file1 to two columns of file2, f2_col1 and f2_col2 without chang
Linux: assigning character ids from file1 to two columns of file2, f2_col1 and f2_col2 without chang

Time:11-10

I have two files (both with the same N=millions of rows)

f1.txt :

1       J100079
2       J100180
3       J100228
4       J100291
5       J100333
6       J100537
7       J100549
8       J100757
9       J100953
10      J101030

and f2.txt:

1       1       117656  0.494925
2       1       117656  0.0021814
2       2       117656  0.496289
3       1       117656  -0.00205095
3       2       117656  0.0024429
3       3       117656  0.495278
4       1       117656  -0.000898346
4       2       117656  -0.00520983
4       3       117656  -0.00694337
4       4       117656  0.495535

I wish to create a f_final.txt file that has assigned character ids from f1.txt to columns 1 and 2 of f2.txt, with no changes in the order of rows of f2.txt.

  1. I started the process by creating an f3.txt as a join of f1.txt and f2.txt:

    join f1.txt f2.txt > f3.txt
    
    cat f3.txt
    
    1 J100079 1 117656 0.494925
    2 J100180 1 117656 0.0021814
    2 J100180 2 117656 0.496289
    3 J100228 1 117656 -0.00205095
    3 J100228 2 117656 0.0024429
    3 J100228 3 117656 0.495278
    4 J100291 1 117656 -0.000898346
    4 J100291 2 117656 -0.00520983
    4 J100291 3 117656 -0.00694337
    4 J100291 4 117656 0.495535
    
  2. Separated from f3.txt two new files: f4.txt and f5.txt using cut (be aware that now the delimiter has become ' ' from join):

    cut -d$' ' -f 1,2 f3.txt > f4.txt
    cut -d$' ' -f 3,5 f3.txt > f5.txt
    
    cat f4.txt
    
    1 J100079
    2 J100180
    2 J100180
    3 J100228
    3 J100228
    3 J100228
    4 J100291
    4 J100291
    4 J100291
    4 J100291
    
    cat f5.txt
    
    1 0.494925
    1 0.0021814
    2 0.496289
    1 -0.00205095
    2 0.0024429
    3 0.495278
    1 -0.000898346
    2 -0.00520983
    3 -0.00694337
    4 0.495535
    
  3. f4.txt is good (no more changes to it). For f5.txt I want to join it with f1.txt so I can assign to f5.txt numerical ids, the f1.txt character ids, but I do not want to change the order of the rows, so no sorting on f5.txt

    join f1.txt f5.txt > f6.txt
    
    join: f5.txt:7: is not sorted: 1 -0.000898346
    join: f1.txt:10: is not sorted: 10      J101030
    
  4. step3 has an error. The last step would have been to column bind f4.txt and f6.txt with no changes in the order of rows.

    paste -d" " f4.txt f6.txt > f_final.txt
    

The final output could have been like this:

1   J100079 1   J100079 0.494925
2   J100180 1   J100079 0.0021814
2   J100180 2   J100180 0.496289
3   J100228 1   J100079 -0.00205095
3   J100228 2   J100180 0.0024429
3   J100228 3   J100228 0.495278
4   J100291 1   J100079 -0.000898346
4   J100291 2   J100180 -0.00520983
4   J100291 3   J100228 -0.00694337
4   J100291 4   J100291 0.495535

Any suggestions greatly appreciated

CodePudding user response:

This matches your example exactly:

join f1.txt f2.txt |
sort -k '3,3' |
join -o '2.1,2.2,2.3,1.2,2.5' -1 1 -2 3 f1.txt - |
sort -k 1,1 > final.txt

If you need the last sort or not is up to you.

CodePudding user response:

Here is a solution my way with join and sort:

join f1.txt f2.txt > f3.txt
cat f3.txt
join -1 1 -2 3 -o'1.1,1.2,2.1,2.2,2.5' <(sort -k1 f1.txt) <(sort -k3 f3.txt) > f7.txt
cat f7.txt
sort -k1 -k3 < f7.txt > f8.txt
cat f8.txt

If you have a shorter way that may save computing time to millions of rows joining post a solution. For sure I will not use cat, because I have tested it works with the toy example.

  • Related