Linux: assigning character ids from file1 to two columns of file2, f2_col1 and f2

I have two files (both with the same N=millions of rows)

f1.txt :

1       J100079
2       J100180
3       J100228
4       J100291
5       J100333
6       J100537
7       J100549
8       J100757
9       J100953
10      J101030

and f2.txt:

1       1       117656  0.494925
2       1       117656  0.0021814
2       2       117656  0.496289
3       1       117656  -0.00205095
3       2       117656  0.0024429
3       3       117656  0.495278
4       1       117656  -0.000898346
4       2       117656  -0.00520983
4       3       117656  -0.00694337
4       4       117656  0.495535

I wish to create a f_final.txt file that has assigned character ids from f1.txt to columns 1 and 2 of f2.txt, with no changes in the order of rows of f2.txt.

I started the process by creating an f3.txt as a join of f1.txt and f2.txt:

join f1.txt f2.txt > f3.txt

cat f3.txt

1 J100079 1 117656 0.494925
2 J100180 1 117656 0.0021814
2 J100180 2 117656 0.496289
3 J100228 1 117656 -0.00205095
3 J100228 2 117656 0.0024429
3 J100228 3 117656 0.495278
4 J100291 1 117656 -0.000898346
4 J100291 2 117656 -0.00520983
4 J100291 3 117656 -0.00694337
4 J100291 4 117656 0.495535

Separated from f3.txt two new files: f4.txt and f5.txt using cut (be aware that now the delimiter has become ' ' from join):

cut -d$' ' -f 1,2 f3.txt > f4.txt
cut -d$' ' -f 3,5 f3.txt > f5.txt

cat f4.txt

1 J100079
2 J100180
2 J100180
3 J100228
3 J100228
3 J100228
4 J100291
4 J100291
4 J100291
4 J100291

cat f5.txt

1 0.494925
1 0.0021814
2 0.496289
1 -0.00205095
2 0.0024429
3 0.495278
1 -0.000898346
2 -0.00520983
3 -0.00694337
4 0.495535

f4.txt is good (no more changes to it). For f5.txt I want to join it with f1.txt so I can assign to f5.txt numerical ids, the f1.txt character ids, but I do not want to change the order of the rows, so no sorting on f5.txt
```
join f1.txt f5.txt > f6.txt

join: f5.txt:7: is not sorted: 1 -0.000898346
join: f1.txt:10: is not sorted: 10      J101030
```
step3 has an error. The last step would have been to column bind f4.txt and f6.txt with no changes in the order of rows.
```
paste -d" " f4.txt f6.txt > f_final.txt
```

The final output could have been like this:

1   J100079 1   J100079 0.494925
2   J100180 1   J100079 0.0021814
2   J100180 2   J100180 0.496289
3   J100228 1   J100079 -0.00205095
3   J100228 2   J100180 0.0024429
3   J100228 3   J100228 0.495278
4   J100291 1   J100079 -0.000898346
4   J100291 2   J100180 -0.00520983
4   J100291 3   J100228 -0.00694337
4   J100291 4   J100291 0.495535

Any suggestions greatly appreciated

CodePudding user response：

This matches your example exactly:

join f1.txt f2.txt |
sort -k '3,3' |
join -o '2.1,2.2,2.3,1.2,2.5' -1 1 -2 3 f1.txt - |
sort -k 1,1 > final.txt

If you need the last sort or not is up to you.

CodePudding user response：

Here is a solution my way with join and sort:

join f1.txt f2.txt > f3.txt
cat f3.txt
join -1 1 -2 3 -o'1.1,1.2,2.1,2.2,2.5' <(sort -k1 f1.txt) <(sort -k3 f3.txt) > f7.txt
cat f7.txt
sort -k1 -k3 < f7.txt > f8.txt
cat f8.txt

If you have a shorter way that may save computing time to millions of rows joining post a solution. For sure I will not use cat, because I have tested it works with the toy example.