I have to read ids from one file and search it in second xml file, if found write that entire line to third file. file1 is 111 MB, file2 is 40 GB
File1.xml
id1
id2
id5
File2.xml
<employees>
<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id3</id><name>test3</name></employee>
<employee><id>id4</id><name>test4</name></employee>
<employee><id>id5</id><name>test5</name></employee>
<employee><id>id6</id><name>test6</name></employee>
</employees>
File3.xml : result
<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id5</id><name>test5</name></employee>
i tried it with grep
grep -i -f file1.xml file2.xml >> file3.xml
but its giving memory exhausted error.
Another way i tried it with loop and awk command.
#while read -r id;do
#awk -v pat="$id" '$0~pat' file2.xml >> file3.xml
#done < file1.xml
its also taking too much time. What could be the best optimal solution for this.
CodePudding user response:
With your shown samples, please try following awk
code. Written and tested in GNU awk
.
awk -v FPAT='<id>[^<]*</id>' '
FNR==NR{
arr["<id>"$0"</id>"]
next
}
($1 in arr)
' file1.xml file2.xml
Explanation: Adding detailed explanation for above.
awk -v FPAT='<id>[^<]*</id>' ' ##Starting awk program and setting FPAT to <id>[^<]*<\\/id>
FNR==NR{ ##Checking condition which will be TRUE when file1.xml is being read.
arr["<id>"$0"</id>"] ##Creating an array arr which has index of <id> $0 </id> here.
next ##next will skip all further statements from here.
}
($1 in arr) ##Checking condition if $1 is present in arr then print that line.
' file1.xml file2.xml ##Mentioning Input_file names here.
CodePudding user response:
This should work in any awk version:
awk 'FNR == NR {
seen["<id>" $1 "</id>"]
next
}
match($0, /<id>[^<]*<\/id>/) && substr($0, RSTART, RLENGTH) in seen
' file1 file2
<employee><id>id1</id><name>test1</name></employee>
<employee><id>id2</id><name>test2</name></employee>
<employee><id>id5</id><name>test5</name></employee>