Home > front end >  PowerShell - Find duplicate files and ignore multiple files in the same compressed file
PowerShell - Find duplicate files and ignore multiple files in the same compressed file

Time:05-13

I got this script and modified it a bit (to avoid extract the same file to one temp file). I have two issues:

  1. When the script founds duplication, the SourchArchive always shows one file (instead of 2 that holds the same file inside)
  2. When a compressed file holds more than 1 same file in a different subfolder (in the same zip) the script return that there is duplication and its not good for me. If the compressed file has 3 files that are the same it should combined to 1 file and then compering it to another compressed file

Update:

The main goal is to compare between compressed files in order to find duplicate files inside the compressed files. The compressed files can be cab or zip (The zip could contains dlls, xml, msi and more. sometimes it contains also a vip files (vip is a compressed file that also contains files like dll)) After compering each compressed file in another the output should be the compressed files that holds the same files inside It will be great to separate between the result with ----------

this should be as part of a bigger script that should stop if there are duplicate files in more than 1 compressed file so only if $MatchedSourceFiles has result the script will stop otherwise should continue. I hope its clear now

Example:
test1.zip contains temp.xml 
test2.zip contains temp.xml

The output should be:
SourceArchive       DuplicateFile
test1.zip           temp.xml
test2.zip           temp.xml
------------------------------
The next duplication files 
------------------------------

Example 2: (multiple identical files in the same compressed file)
test1.zip contains 2 subfolders
test1.zip contains temp.xml under subfolder1 and also temp.xml under subfolder2 

The result should be none
SourceArchive       DuplicateFile

Example 3:
test1.zip same as in example 2 
test3.zip contains temp.xml

The result should be:

SourceArchive       DuplicateFile
    test1.zip           temp.xml
    test3.zip           temp.xml
    ------------------------------
    The next duplication files
    ------------------------------
    The next duplication files
    ------------------------------

Add-Type -AssemblyName System.IO.Compression.FileSystem

$tempFolder = Join-Path -Path ([IO.Path]::GetTempPath()) -ChildPath (New-GUID).Guid
$compressedfiles = Get-ChildItem -Path 'C:\Intel' -Include '*.zip', '*.CAB' -File -Recurse

$MatchedSourceFiles = foreach ($file in $compressedfiles) {
    switch ($file.Extension) {
        '.zip' {
            $t = $tempFolder   "\"   $file.Name
            # the destination folder should NOT already exist here
            $null = [System.IO.Compression.ZipFile]::ExtractToDirectory($file.FullName, $t )
            try {
                Get-ChildItem -Path $tempFolder -Filter '*.vip' -File -Recurse | ForEach-Object {
                    $null = [System.IO.Compression.ZipFile]::ExtractToDirectory($_.FullName, $t)
                }
            }
            catch {}
        }
        '.cab' {
            # the destination folder MUST exist for expanding .cab files
            $null = New-Item -Path $tempFolder -ItemType Directory -Force
            expand.exe $file.FullName -F:* $tempFolder > $null
        }
    }
    # now see if there are files with duplicate names
    Get-ChildItem -Path $tempFolder -File -Recurse -Exclude vip.manifest, filesSources.txt, *.vip | Group-Object Name | 
    Where-Object { $_.Count -gt 1 } | ForEach-Object { 
        foreach ($item in $_.Group) {
            # output objects to be collected in $MatchedSourceFiles
            [PsCustomObject]@{
                SourceArchive = $file.FullName
                DuplicateFile = '.{0}' -f $item.FullName.Substring($tempFolder.Length)  # relative path
            }
        }
    }
}

# display on screen
$MatchedSourceFiles
$tempFolder | Remove-Item -Force -Recurse

CodePudding user response:

Thanks for the examples. Using these, I changed my previous code to this:

Add-Type -AssemblyName System.IO.Compression.FileSystem

$tempFolder      = Join-Path -Path ([IO.Path]::GetTempPath()) -ChildPath (New-GUID).Guid
$compressedfiles = Get-ChildItem -Path 'C:\Intel' -Include '*.zip','*.CAB' -File -Recurse

$MatchedSourceFiles = foreach ($file in $compressedfiles) {
    switch ($file.Extension) {
        '.zip' {
            # the destination folder should NOT already exist here
            $null = [System.IO.Compression.ZipFile]::ExtractToDirectory($file.FullName, $tempFolder)
            # prepare a subfolder name for .vip files
            $subTemp = Join-Path -Path $tempFolder -ChildPath ([datetime]::Now.Ticks)
            Get-ChildItem -Path $tempFolder -Filter '*.vip' -File -Recurse | ForEach-Object {
                $null = [System.IO.Compression.ZipFile]::ExtractToDirectory($_.FullName, $subTemp)
            }
        }
        '.cab' {
            # the destination folder MUST exist for expanding .cab files
            $null = New-Item -Path $tempFolder -ItemType Directory -Force
            expand.exe $file.FullName -F:* $tempFolder > $null
        }
    }
    # output objects for each unique file name in the extracted folder to collect in $MatchedSourceFiles
    Get-ChildItem -Path $tempFolder -File -Recurse | 
        Select-Object @{Name = 'SourceArchive'; Expression = {$file.FullName}},
                      @{Name = 'FileName'; Expression = {$_.Name}} -Unique

    # delete the temporary folder
    $tempFolder | Remove-Item -Force -Recurse
}

# at this point $MatchedSourceFiles contains all (unique) filenames from all .zip and/or .cab files

# now see if there are files with duplicate names between the archive files
$result = $MatchedSourceFiles | Group-Object FileName | Where-Object { $_.Count -gt 1 } | ForEach-Object {$_.Group}

# display on screen
$result

# save as CSV file
$result | Export-Csv -Path 'X:\DuplicateFiles.csv' -UseCulture -NoTypeInformation

The output would be:

Example 1:

SourceArchive      FileName
-------------      --------
C:\Intel\test1.zip temp.xml
C:\Intel\test2.zip temp.xml

Example 2:

no output

Example 3:

SourceArchive      FileName
-------------      --------
C:\Intel\test1.zip temp.xml
C:\Intel\test3.zip temp.xml
  • Related