Home > Software design >  How to merge 2 x CSVs with the same column but overwrite not append?
How to merge 2 x CSVs with the same column but overwrite not append?

Time:08-24

I've got this one that has been baffling me all day, and I can't seem to find any search results that match exactly what I am trying to do.

I have 2 CSV files, both of which have the same columns and headers. They look like this (shortened for the purpose of this post):

"plate","labid","well"
"1013740016604537004556","none46","F006"
"1013740016604537004556","none47","G006"
"1013740016604537004556","none48","H006"
"1013740016604537004556","3835265","A007"
"1013740016604537004556","3835269","B007"
"1013740016604537004556","3835271","C007"

Each of the 2 CSVs only have some actual Lab IDs, and the 'nonexx' are just fillers for the importing software. There is no duplication ie each 'well' is only referenced once across the 2 files.

What I need to do is merge the 2 CSVs, for example the second CSV might have a Lab ID for well H006 but the first will not. I need the lab ID from the second CSV imported into the first, overwriting the 'nonexx' currently in that column.

Here is my current code:


            $CSVB = Import-CSV "$RootDir\SymphonyOutputPending\$plateID`A_Header.csv"
            Import-CSV "$RootDir\SymphonyOutputPending\$plateID`_Header.csv" | ForEach-Object {
                $CSVData = [PSCustomObject]@{
                    labid = $_.labid
                    well = $_.well
                }
                
                If ($CSVB.well -match $CSVData.wellID) {
                    write-host "I MATCH"
                    ($CSVB | Where-Object {$_.well -eq $CSVData.well}).labid = $CSVData.labid
            }
            $CSVB | Export-CSV "$RootDir\SymphonyOutputPending\$plateID`_final.csv" -NoTypeInformation
        }

The code runs but doesn't 'merge' the data, the final CSV output is just a replication of the first input file. I am definitely getting a match as the string "I MATCH" appears several times when debugging as expected.

CodePudding user response:

Based on the responses in the comments of your question, I believe this is what you are looking for. This assumes that the both CSVs contain the exact same data with labid being the only difference.

There is no need to modify csv2 if we are just grabbing the labid to overwrite the row in csv1.

$csv1 = Import-Csv C:\temp\LabCSV1.csv
$csv2 = Import-Csv C:\temp\LabCSV2.csv

# Loop through csv1 rows
Foreach($line in $csv1) {
    # If Labid contains "none"
    If($line.labid -like "none*") {
        # Set rows labid to the labid from csv2 row that matches plate/well
        # May be able to remove the plate section if well is a unique value
        $line.labid = ($csv2 | Where {$_.well -eq $line.well -and $_.plate -eq $line.plate}).labid
    }
}
# Export to CSV - not overwrite - to confirm results
$csv1 | export-csv C:\Temp\LabCSV1Adjusted.csv -NoTypeInformation

CodePudding user response:

Since you need to do a bi-directional comparison of the 2 Csvs you could create a new array of both and then group the objects by their well property, for this you can use Group-Object, then filter each group if their Count is equal to 2 where their labid property does not start with none else return the object as-is.

Using the following Csvs for demonstration purposes:

  • Csv1
"plate","labid","well"
"1013740016604537004556","none46","F006"
"1013740016604537004556","none47","G006"
"1013740016604537004556","3835265","A007"
"newrowuniquecsv1","none123","X001"
  • Csv2
"plate","labid","well"
"1013740016604537004556","none48","A007"
"1013740016604537004556","3835269","F006"
"1013740016604537004556","3835271","G006"
"newrowuniquecsv2","none123","X002"

Code

Note that this code assumes there will be a maximum of 2 objects with the same well property and, if there are 2 objects with the same well, one of them must have a value not starting with none.

$mergedCsv = @(
    Import-Csv pathtocsv1.csv
    Import-Csv pathtocsv2.csv
)

$mergedCsv | Group-Object well | ForEach-Object {
    if($_.Count -eq 2) {
        return $_.Group.Where{ -not $_.labid.StartsWith('none') }
    }

    $_.Group
} | Export-Csv pathtomerged.csv -NoTypeInformation

Output

plate                  labid   well
-----                  -----   ----
1013740016604537004556 3835265 A007
1013740016604537004556 3835269 F006
1013740016604537004556 3835271 G006
newrowuniquecsv1       none123 X001
newrowuniquecsv2       none123 X002

CodePudding user response:

If the lists are large, performance might be an issue as Where-Object (or any other where method) and Group-Object do not perform very well for embedded loops.

By indexing the second csv file (aka creating a hashtable), you have quicker access to the required objects. Indexing upon two (or more) items (plate and well) is issued here: Does there exist a designated (sub)index delimiter? and resolved by @mklement0 and zett42 with a nice CaseInsensitiveArrayEqualityComparer class.

To apply this class on Drew's helpful answer:

$csv1 = Import-Csv C:\temp\LabCSV1.csv
$csv2 = Import-Csv C:\temp\LabCSV2.csv

$dict = [hashtable]::new([CaseInsensitiveArrayEqualityComparer]::new())
$csv2.ForEach{ $dict.($_.plate, $_.well) = $_ }

Foreach($line in $csv1) {
    If($line.labid -like "none*") {
        $line.labid = $dict.($line.plate, $line.well).labid
    }
}
$csv1 | export-csv C:\Temp\LabCSV1Adjusted.csv -NoTypeInformation
  • Related