Home > Mobile >  Fastest, most efficient way to check for changes in a large XML file (100k entries) updated daily w
Fastest, most efficient way to check for changes in a large XML file (100k entries) updated daily w

Time:05-15

I am working with xml files with 50-150k entries, and about 50-100MB in size that change daily. All entries are unique, and have 10-15 elements per entry (id, title, etc.). I'm currently pulling the xml file into a string, using simplexml to parse it, and then looping through each entry to check for changes.

Here's the basic code...

$data = file_get_contents("test.xml");
$data = preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}] /u', ' ', $data);
    

$xml = simplexml_load_string($data);

$item_count = count($xml->entry);   

foreach (range(0, $item_count - 1, 1) as $num) {
    
    $id = (string)$xml->entry[$num]->id;
    $title = (string)$xml->entry[$num]->title;

    // ... etc. ...
    

I'm on a VPS server (1CPU, 4GB RAM, 40GB). Without any other code, just iterating over an xml file with 70k entries, at about 80MB takes 25-30 minutes. Performance wanes over time with the CPU at 98% after a few minutes (RAM is fine). The first 30K entries take 10 minutes, while the second 40K take about 20 minutes.

Is there a faster, more efficient way to do this than simplexml?...Or a better method for checking large XML files that change daily? (i.e. MySQL import/queries, etc.)

I've seen suggestions to use SAX parser, but I do like the easy access to elements that simplexml provides. If I can stream the xml and save RAM, that's preferable as well.

One last note, if it helps, my current logic for checking for changes is as follows:

  1. Create an array of previously imported entries with the entry ID as the key and a string of entry values to check as the value
  2. Loop through xml file and check if the current entry id/value pair exists...
// Check if current entry is in array 
if ($previously_imported_entries[$id] === $value1 . " ---- " . $value2 . " --- " . $value3) {

   // Item is in array, and values are the same
   // No changes
   // etc...   
} else {
   
   // Item isn't in the array or values have changed
   // Import entry either way  

}

CodePudding user response:

XMLReader SimpleXML.

Incredible!! Same concept, same files, new result time: 6-10 seconds.

Credit: Bartosz Pachołek, Linkedin Post "Parsing huge XML files with PHP"

*Thanks also to stackoverflow user Maharramoff for pointing me to XMLReader

Here is Bartosz's example code:

<?php
//include "memcheck.php";
$start = time();

$xml = XMLReader::open('random5.xml');
//go to the first 'object' element
while ($xml->name !== 'object') { 
    $xml->read(); 
}

do {
    $object = simplexml_load_string($xml->readOuterXml());
    $id = (string) $object->id;
    $name = (string) $object->name;
    $features = [];
    foreach($object->features->feature as $feature) {
        $features[(string)$feature->id] = (string) $feature->name;
    }

    $services = [];
    foreach($object->services->service as $service) {
        $services[(string)$service->id] = (string) $service->service;
    }

    //here again we have all data of an object
} while ($xml->next('object'));

//var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

This worked fantastic for me.

Just replaced 'object' in line 7 to 'entry' to loop through all of the entry tags in my xml files, and updated the specific element names/variables.

*I commented out memcheck.php and the second to last line as I didn't use it.

NOTE: For the xml files I used, I first had to remove all of the invalid characters to avoid PCDATA invalid Char errors. If it helps anyone, here's a script that I used:

$data = file_get_contents('test.xml');
        
$data = preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}] /u', ' ', $data);
        
file_put_contents('file.xml', $data);

Bartosz's article lays out several tests with 4GB and other XML files, comparing the speeds with SimpleXML, DOM, SAX Expat Parser, XMLReader, etc. Very helpful. Check it out if you're interested in his other finds.

For the speed & efficiency I needed, plus the simplicity I enjoyed with SimpleXML, this solution is best.

Test and see if it works for you.

CodePudding user response:

The XMLReader class might solve the problem

"If you have to deal with massive XML documents, use XMLReader to process them. Don't try and gather an entire XML document into a PHP data structure using XMLReader and a PHP xml2assoc() function, you are reinventing the SimpleXML wheel. When parsing massive XML documents using XMLReader, gather the data you need to perform an operation then perform it before skipping to the next node."

You can read more here

  • Related