I need to be able to identify some large binary files which have been copied and renamed between secure servers. To do this, I would like to be able to hash the first X bytes and the last X bytes of all the files. I need to do this with only what is available on a standard Windows 10 system with no additional software installed, so PowerShell seems like the right choice.
Some things that don't work:
- I cannot read the entire file in, then extract the parts of the file I want to hash. The objective I'm trying to achieve is to minimize the amount of the file I need to read, and reading the entire file defeats that purpose.
- Reading moderately large portions of a file into a PowerShell variable appears to be pretty slow, so
$hash.ComputeHash($moderatelyLargeVariable)
doesn't seem like a viable solution.
I'm pretty sure I need to do $hash.ComputeHash($stream)
where $stream
only streams part of the file.
Thus far I've tried:
function Get-FileStreamHash {
param (
$FilePath,
$Algorithm
)
$hash = [Security.Cryptography.HashAlgorithm]::Create($Algorithm)
## METHOD 0: See description below
$stream = ([IO.StreamReader]"${FilePath}").BaseStream
$hashValue = $hash.ComputeHash($stream)
## END of part I need help with
# Convert to a hexadecimal string
$hexHashValue = -join ($hashValue | ForEach-Object { "{0:x2}" -f $_ })
$stream.Close()
# return
$hexHashValue
}
Method 0: This works, but it's streaming the whole file and thus doesn't solve my problem. For a 3GB file this takes about 7 seconds on my machine.
Method 1: $hashValue = $hash.ComputeHash((Get-Content -Path $FilePath -Stream ""))
. This also is streaming the whole file, and it also takes forever. For the same 3GB file it takes something longer than 5 minutes (I cancelled at that point, and don't know what the total duration would be).
Method 2: $hashValue = $hash.ComputeHash((Get-Content -Path $FilePath -Encoding byte -TotalCount $qtyBytes -Stream ""))
. This is the same as Method 1, except that it limits the content to $qtyBytes
. At 1000000 (1MB) it takes 18 seconds. I think that means Method 1 would have taken ~15 hours, 7700x slower than Method 0.
Is there a way to do something like Method 2 (limit what is read) but without the slow down? And if so, is there a good way to do it on just the end of the file?
Thanks!
CodePudding user response:
You could try one (or a combination of both) of the following helper functions to read a number of bytes from the beginning of the file or taken from the end:
function Read-FirstBytes {
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, Position = 0)]
[Alias('FullName', 'FilePath')]
[ValidateScript({ Test-Path -Path $_ -PathType Leaf })]
[string]$Path,
[Parameter(Mandatory=$true, Position = 1)]
[int]$Bytes,
[ValidateSet('ByteArray', 'HexString', 'Base64')]
[string]$As = 'ByteArray'
)
try {
$stream = [System.IO.File]::OpenRead($Path)
$length = [math]::Min([math]::Abs($Bytes), $stream.Length)
$buffer = [byte[]]::new($length)
$null = $stream.Read($buffer, 0, $length)
switch ($As) {
'HexString' { ($buffer | ForEach-Object { "{0:x2}" -f $_ }) -join '' ; break }
'Base64' { [Convert]::ToBase64String($buffer) ; break }
default { ,$buffer }
}
}
catch { throw }
finally { $stream.Dispose() }
}
function Read-LastBytes {
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, Position = 0)]
[Alias('FullName', 'FilePath')]
[ValidateScript({ Test-Path -Path $_ -PathType Leaf })]
[string]$Path,
[Parameter(Mandatory=$true, Position = 1)]
[int]$Bytes,
[ValidateSet('ByteArray', 'HexString', 'Base64')]
[string]$As = 'ByteArray'
)
try {
$stream = [System.IO.File]::OpenRead($Path)
$length = [math]::Min([math]::Abs($Bytes), $stream.Length)
$null = $stream.Seek(-$length, 'End')
$buffer = for ($i = 0; $i -lt $length; $i ) { $stream.ReadByte() }
switch ($As) {
'HexString' { ($buffer | ForEach-Object { "{0:x2}" -f $_ }) -join '' ; break }
'Base64' { [Convert]::ToBase64String($buffer) ; break }
default { ,[Byte[]]$buffer }
}
}
catch { throw }
finally { $stream.Dispose() }
}
Then you can compute a hash value from it and format as you like.
Combinations are possible like
$begin = Read-FirstBytes -Path 'D:\Test\somefile.dat' -Bytes 50 # take the first 50 bytes
$end = Read-LastBytes -Path 'D:\Test\somefile.dat' -Bytes 1000 # and the last 1000 bytes
$Algorithm = 'MD5'
$hash = [Security.Cryptography.HashAlgorithm]::Create($Algorithm)
$hashValue = $hash.ComputeHash($begin $end)
($hashValue | ForEach-Object { "{0:x2}" -f $_ }) -join ''
CodePudding user response:
I believe this would be a more efficient way of reading the last bytes of your file using System.IO.BinaryReader
. You can combine this function with the function you have, it can read all bytes, last n
bytes (-Last
) or first n
bytes (-First
).
function Read-Bytes {
[cmdletbinding()]
param(
[parameter(
Mandatory,
ValueFromPipelineByPropertyName,
Position = 0
)][alias('FullName')]
[ValidateScript({
if(Test-Path $_ -PathType Leaf)
{
return $true
}
throw 'Invalid File Path'
})]
[System.IO.FileInfo]$Path,
[parameter(
HelpMessage = 'Specifies the number of Bytes from the beginning of a file.',
ParameterSetName = 'FirstBytes',
Position = 1
)]
[int]$First,
[parameter(
HelpMessage = 'Specifies the number of Bytes from the end of a file.',
ParameterSetName = 'LastBytes',
Position = 1
)]
[int]$Last
)
process
{
try
{
$reader = [System.IO.BinaryReader]::new(
[System.IO.File]::Open(
$Path.FullName,
[system.IO.FileMode]::Open,
[System.IO.FileAccess]::Read
)
)
$stream = $reader.BaseStream
$length = (
$stream.Length, $First
)[[int]($First -lt $stream.Length -and $First)]
$stream.Position = (
0, ($length - $Last)
)[[int]($length -and $length -gt $Last -and $length)]
$bytes = while($stream.Position -ne $length)
{
$stream.ReadByte()
}
[pscustomobject]@{
FilePath = $Path.FullName
Length = $length
Bytes = $bytes
}
}
catch
{
Write-Warning $_.Exception.Message
}
finally
{
$reader.Close()
$reader.Dispose()
}
}
}
Usage
Get-ChildItem . -File | Read-Bytes -Last 100
: Reads the last100
bytes of all files on the current folder. If the-Last
argument exceeds the file length, it reads the entire file.Get-ChildItem . -File | Read-Bytes -First 100
: Reads the first100
bytes of all files on the current folder. If the-First
argument exceeds the file length, it reads the entire file.Read-Bytes -Path path/to/file.ext
: Reads all bytes offile.ext
.
Output
Returns an object with the properties FilePath
, Length
, Bytes
.
FilePath Length Bytes
-------- ------ -----
/home/user/Documents/test/...... 14 {73, 32, 119, 111…}
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 116 {111, 109, 101, 95…}
/home/user/Documents/test/...... 17963 {50, 101, 101, 53…}
/home/user/Documents/test/...... 3617 {105, 32, 110, 111…}
/home/user/Documents/test/...... 638 {101, 109, 112, 116…}
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 36 {65, 99, 114, 101…}
/home/user/Documents/test/...... 735 {117, 112, 46, 79…}
/home/user/Documents/test/...... 1857 {108, 111, 115, 101…}
/home/user/Documents/test/...... 77 {79, 80, 69, 78…}