Home > Software engineering >  Faster way to merge multiple files with unequal number of rows by column in bash
Faster way to merge multiple files with unequal number of rows by column in bash

Time:02-21

I have a multiple files that I want to merge by column using shell script, let's say file a.txt and file b.txt. File a.txt contains sorted unique value and the first column will be used as a reference value.
example :

# cat a.txt 
001|johan
002|mike
003|adam
# cat b.txt
001|chu
001|stewart
002|lewis
002|jordan
003|lambert
003|johnson
003|smith
003|long

The both files will be combined to produce an output like below.

# cat c.txt
001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long

I tried simple iteration using while do.

while read line
do
  ids=`echo $line | awk -F"|" '{print $1}'`
  fn=`grep $ids a.txt`
  echo $fn"|"$line | awk -F"|" '{print $1"|"$2" "$4}'
done < b.txt > c.txt

But it takes a lot of time if I have million number of rows.
In MySQL we can easily achieve it using JOIN clause. But we need to load/insert them first.
Faster approach is probably using paste command, but as far as I know, the total rows of both files must be equal. Well I can adjust a.txt first. But still comsumes a lot of time when the script running.
Maybe someone has a better approach.

CodePudding user response:

You could put all in one awk script:

awk -F'|' '{if (NR==FNR) a[$1]=$2; else print $1 "|" a[$1] " " $2}' a.txt b.txt 
001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long

CodePudding user response:

Assumptions:

  • both files are sorted by the 1st column
  • ignore rows that do not have a match in the opposite file

One idea using join and sed (to remove the 2nd |):

join -t'|' -o 1.1,1.2,2.2 a.txt b.txt | sed -E 's/\|([^|]*$)/ \1/'

One awk idea:

awk '
BEGIN   { FS=OFS="|" }
FNR==NR { a[$1]=$2; next }
$1 in a { print $1,a[$1] " " $2 }
' a.txt b.txt

Both of these generate:

001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long
  • Related