Unix command to search file1 id in another file2 and write result to a file3-CodePudding

I have to read ids from one file and search it in second xml file, if found write that entire line to third file. file1 is 111 MB, file2 is 40 GB

File1.xml

id1
id2
id5

File2.xml

<employees>
<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id3</id><name>test3</name></employee>
<employee><id>id4</id><name>test4</name></employee>
<employee><id>id5</id><name>test5</name></employee>
<employee><id>id6</id><name>test6</name></employee>
</employees>

File3.xml : result

<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id5</id><name>test5</name></employee>

i tried it with grep

grep -i -f file1.xml file2.xml >> file3.xml

but its giving memory exhausted error.

Another way i tried it with loop and awk command.

#while read -r id;do
#awk  -v pat="$id" '$0~pat' file2.xml  >> file3.xml
#done < file1.xml

its also taking too much time. What could be the best optimal solution for this.

CodePudding user response：

With your shown samples, please try following awk code. Written and tested in GNU awk.

awk -v FPAT='<id>[^<]*</id>' '
FNR==NR{
  arr["<id>"$0"</id>"]
  next
}
($1 in arr)
' file1.xml file2.xml

Explanation: Adding detailed explanation for above.

awk -v FPAT='<id>[^<]*</id>' '   ##Starting awk program and setting FPAT to <id>[^<]*<\\/id>
FNR==NR{                         ##Checking condition which will be TRUE when file1.xml is being read.
  arr["<id>"$0"</id>"]           ##Creating an array arr which has index of <id> $0 </id> here.
  next                           ##next will skip all further statements from here.
}
($1 in arr)                      ##Checking condition if $1 is present in arr then print that line.
' file1.xml file2.xml            ##Mentioning Input_file names here.

CodePudding user response：

This should work in any awk version:

awk 'FNR == NR {
   seen["<id>" $1 "</id>"]
   next
}
match($0, /<id>[^<]*<\/id>/) && substr($0, RSTART, RLENGTH) in seen
' file1 file2

<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id5</id><name>test5</name></employee>