I have two files (both with the same N=millions of rows)
f1.txt :
1 J100079
2 J100180
3 J100228
4 J100291
5 J100333
6 J100537
7 J100549
8 J100757
9 J100953
10 J101030
and f2.txt:
1 1 117656 0.494925
2 1 117656 0.0021814
2 2 117656 0.496289
3 1 117656 -0.00205095
3 2 117656 0.0024429
3 3 117656 0.495278
4 1 117656 -0.000898346
4 2 117656 -0.00520983
4 3 117656 -0.00694337
4 4 117656 0.495535
I wish to create a f_final.txt
file that has assigned character ids from f1.txt
to columns 1 and 2 of f2.txt
, with no changes in the order of rows of f2.txt
.
I started the process by creating an
f3.txt
as a join off1.txt
andf2.txt
:join f1.txt f2.txt > f3.txt cat f3.txt 1 J100079 1 117656 0.494925 2 J100180 1 117656 0.0021814 2 J100180 2 117656 0.496289 3 J100228 1 117656 -0.00205095 3 J100228 2 117656 0.0024429 3 J100228 3 117656 0.495278 4 J100291 1 117656 -0.000898346 4 J100291 2 117656 -0.00520983 4 J100291 3 117656 -0.00694337 4 J100291 4 117656 0.495535
Separated from
f3.txt
two new files:f4.txt
andf5.txt
using cut (be aware that now the delimiter has become ' ' from join):cut -d$' ' -f 1,2 f3.txt > f4.txt cut -d$' ' -f 3,5 f3.txt > f5.txt cat f4.txt 1 J100079 2 J100180 2 J100180 3 J100228 3 J100228 3 J100228 4 J100291 4 J100291 4 J100291 4 J100291 cat f5.txt 1 0.494925 1 0.0021814 2 0.496289 1 -0.00205095 2 0.0024429 3 0.495278 1 -0.000898346 2 -0.00520983 3 -0.00694337 4 0.495535
f4.txt
is good (no more changes to it). Forf5.txt
I want to join it withf1.txt
so I can assign tof5.txt
numerical ids, thef1.txt
character ids, but I do not want to change the order of the rows, so no sorting onf5.txt
join f1.txt f5.txt > f6.txt join: f5.txt:7: is not sorted: 1 -0.000898346 join: f1.txt:10: is not sorted: 10 J101030
step3 has an error. The last step would have been to column bind f4.txt and f6.txt with no changes in the order of rows.
paste -d" " f4.txt f6.txt > f_final.txt
The final output could have been like this:
1 J100079 1 J100079 0.494925
2 J100180 1 J100079 0.0021814
2 J100180 2 J100180 0.496289
3 J100228 1 J100079 -0.00205095
3 J100228 2 J100180 0.0024429
3 J100228 3 J100228 0.495278
4 J100291 1 J100079 -0.000898346
4 J100291 2 J100180 -0.00520983
4 J100291 3 J100228 -0.00694337
4 J100291 4 J100291 0.495535
Any suggestions greatly appreciated
CodePudding user response:
This matches your example exactly:
join f1.txt f2.txt |
sort -k '3,3' |
join -o '2.1,2.2,2.3,1.2,2.5' -1 1 -2 3 f1.txt - |
sort -k 1,1 > final.txt
If you need the last sort
or not is up to you.
CodePudding user response:
Here is a solution my way with join and sort:
join f1.txt f2.txt > f3.txt
cat f3.txt
join -1 1 -2 3 -o'1.1,1.2,2.1,2.2,2.5' <(sort -k1 f1.txt) <(sort -k3 f3.txt) > f7.txt
cat f7.txt
sort -k1 -k3 < f7.txt > f8.txt
cat f8.txt
If you have a shorter way that may save computing time to millions of rows joining post a solution. For sure I will not use cat, because I have tested it works with the toy example.