Timeout while parsing CSV file-CodePudding

I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.

I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:

$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');

while(($row = fgetcsv($file)) != FALSE){
    // id is the 7th value in the row
    $id = $row[6];
    
    // check table to see if id exists
    $sql = "SELECT id FROM table WHERE id = $id";
    $result = mysqli_query($conn, $sql);
    
    // if id is in the database, skip to next row
    if(mysqli_num_rows($result) > 0){
        continue;
    }
    
    // else write line to temp file
    fputcsv($temp, $row);
}

fclose($file);
fclose($temp);

// overwrite original file
rename($tempFilename, $filename);

Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?

CodePudding user response：

You fire a database query per line, aka 45.000 queries... that takes too much time.

Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.

Pseudo code:

$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
    $lookup[ $row['id'] ] = $row['id'];
}

// now read CSV
while($row = fgetcsv($h)) {
    $id = $row[6];

    if (isset($lookup[ $id ])) {
        // exist...
        continue;
    }

    // write the non-existing id to different file...
}

edit: Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?

Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database - those ids remaining exist in CSV but not in database.

Pseudo code:

$ids_csv = [];
while($row = fgetcsv($h)) {
    $id = row[6];
    $ids_csv[] = intval($id);
}

$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));

$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
    $ids_db[] = $row['id'];
}

$missing_in_db = array_diff($ids_csv, $ids_db);

CodePudding user response：

I would use LOAD DATA INFILE: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Your database user needs to have FILE priveleges on the database to use. to read the csv file into a separate table.
Then you can run one query to delete id's already exist (delete from join ...)
And export the rows that were left intact.

Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.

CodePudding user response：

A little bit of everything, trying to work on as large parts of the data sets as possible. You didn't state your memory requirements, so I'm not sure whether all this is for you or not. Also, is it a one-time job? Or are you running this regularly? If so, you might have to take care of safeguarding the data so that it doesn't change while you're processing it.

$sql = "SELECT id FROM table";
$result = $mysqli->query($sql);

// Fetch all existing ids from database
$id_lookup = [];
foreach($result->fetch_all(MYSQLI_ASSOC) as $row) {
    $id_lookup[$row['id']] = true;
}

$result->free_result();

// Read all lines from file
foreach(file($filename) as $line) {
    $csv = str_getcsv($line);

    if (array_key_exists($csv[6], $id_lookup) {
        // id exists
    }

    // id does not exist in db
}