Fast compare two large csv(boths rows and columns) in powershell-CodePudding

I have two large CSVs to compare. Bosth csvs are basically data from the same system 1 day apart. No of rows are around 12k and columns 30.

The aim is to identify what column data has changed for primary key(#ID).

My idea was to loop through the CSVs to identify which rows have changed and dump these into a separate csvs. One done, I again loop through the changes rows, and indetify the exact change in column.

    NewCSV = Import-Csv -Path ".\Data_A.csv"
    OldCSV = Import-Csv -Path ".\Data_B.csv"
     
   foreach ($LineNew in $NewCSV)
    {
        ForEach ($LineOld in $OldCSV)
        {
            If($LineNew -eq $LineOld)
            {
                Write-Host $LineNew, " Match"
            }else{
                Write-Host $LineNew, " Not Match"
            }
        }
    }

But as soon as run the loop, it takes forever to run for 12k rows. I was hoping there must be a more efficient way to compare large files powershell. Something that is quicker.

CodePudding user response：

Well you can give this a try, I'm not claiming it will be fast for what vonPryz has already pointed out but it should give you a good side-by-side perspective to compare what has changed from OldCsv to NewCsv.

Note: Those cells that have the same value on both CSVs will be ignored.

$NewCSV = Import-Csv -Path ".\Data_A.csv"
$OldCSV = Import-Csv -Path ".\Data_B.csv" | Group-Object ID -AsHashTable -AsString

$properties = $newCsv[0].PSObject.Properties.Name

$result = foreach($line in $NewCSV)
{
    if($ref = $OldCSV[$line.ID])
    {
        foreach($prop in $properties)
        {
            if($line.$prop -ne $ref.$prop)
            {
                [pscustomobject]@{
                    ID = $line.ID
                    Property = $prop
                    OldValue = $ref.$prop
                    NewValue = $line.$prop
                }
            }
        }
        continue
    }

    Write-Warning "ID $($line.ID) could not be found on Old Csv!!"
}

CodePudding user response：

As vonPryz hints in the comments, you've written an algorithm with quadratic time complexity (O(n²) in Big-O notation) - every time the input size doubles, the number of computations performed increase 4-fold.

To avoid this, I'd suggest using a hashtable or other dictionary type to hold each data set, and use the primary key from the input as the dictionary key. This way you get constant-time lookup of corresponding records, and the time complexity of your algorithm becomes near-linear (O(2n k)):

$NewCSV = @{}
Import-Csv -Path ".\Data_A.csv" |ForEach-Object {
  $NewCSV[$_.ID] = $_
}

$OldCSV = @{}
Import-Csv -Path ".\Data_B.csv" |ForEach-Object {
  $OldCSV[$_.ID] = $_
}

Now that we can efficiently resolve each row by it's ID, we can inspect the whole of the data sets with an independent loop over each:

foreach($entry in $NewCSV.GetEnumerator()){
  if(-not $OldCSV.ContainsKey($entry.Key)){
    # $entry.Value is a new row, not seen in the old data set
  }

  $newRow = $entry.Value
  $oldRow = $OldCSV[$entry.Key]

  # do the individual comparison of the rows here
}

Do another loop like above, but with $NewCSV in place of $OldCSV to find/detect deletions.