I'm having some difficulty removing nodes from an xml file. I've found lots of examples of others doing this in powershell through various means, the code below would seem to be identical to many of the other examples I've seen, but I'm not getting the desired behavior.
My goal is to reduce the size of the output XML until it's below 4KB.
The code below doesn't error out, but the count of objects in $updateactivity never changes, so the node doesn't seem to be removing.
This is a log in xml format, so I'm removing the oldest entries first.
sample xml:
<?xml version="1.0" encoding="utf-16"?>
<LogEntries version="1.0" appname="Dell Command | Update" appversion="4.3.0">
<LogEntry>
<serviceVersion>2.3.0.36</serviceVersion>
<appname>DellCommandUpdate</appname>
<level>Normal</level>
<timestamp>2022-01-07T13:29:57.9364469-08:00</timestamp>
<source>UpdateScheduler.UpdateScheduler.Start</source>
<message>Starting the update scheduler.</message>
<trace/>
<data/>
</LogEntry>
</LogEntries>
code:
[xml]$dcuxml = get-content "C:\ProgramData\dell\UpdateService\Log\Activity.log"
$xmllog = $dcuxml.LogEntries
$update_activity = $xmllog.LogEntry | NotableDCU
$i = 0
Do{
foreach($entry in $update_activity){
$entry.parentnode.RemoveChild($entry)
$xmlsize = [System.Text.Encoding]::UTF8.GetByteCount(($update_activity.InnerXml | Out-String)) / 1KB
}
}while($xmlsize -gt 3.99)
CodePudding user response:
Here is a possible solution. The basic idea is to create a memory stream and serialize XML nodes one-by-one, starting from the end, until the stream has reached the defined maximum size. While doing so, calculate the number of nodes to remove. Only then actually remove the nodes from the original document and save it.
$inputPath = "$PWD\log.xml"
$outPath = "$PWD\log_new.xml"
$maxXmlSize = 4KB
# Use Load() method instead of Get-Content to respect encoding attribute of XML doc
$dcuxml = [xml]::new(); $dcuxml.Load( $inputPath )
$xmllog = $dcuxml.DocumentElement.ChildNodes # | NotableDCU
$writerSettings = [Xml.XmlWriterSettings] @{
Encoding = [Text.Encoding]::Unicode # UTF-16 as in input document
# Replace with this line to encode in UTF-8 instead
# Encoding = [Text.Encoding]::UTF8
Indent = $true
IndentChars = "`t"
ConformanceLevel = [Xml.ConformanceLevel]::Auto
}
# Create a MemoryStream that is connected to a XmlWriter
$memStream = [IO.MemoryStream]::new()
$memWriter = [Xml.XmlWriter]::Create( $memStream, $writerSettings )
$memWriter.WriteStartDocument()
$memWriter.WriteStartElement( $dcuxml.DocumentElement.Name )
# For simplicity we don't write the attributes of DocumentElement
$numEntriesToRemove = $xmllog.Count
# Loop over log entries in reverse until stream has reached maxXmlSize
# to count number of entries to remove.
for( $i = $xmllog.Count - 1; $i -ge 0; --$i ) {
$xmllog[$i].WriteTo( $memWriter )
$memWriter.Flush() # make sure stream is actually written now
# Check stream length which is always in bytes
if( $memStream.Length -ge $maxXmlSize ) { break }
# While we are below maxXmlSize, decrement the number of entries to remove
--$numEntriesToRemove
}
# Remove log entries to bring the output size below maxXmlSize.
foreach( $entry in $xmllog[ 0..($numEntriesToRemove - 1) ] ) {
$null = $entry.ParentNode.RemoveChild( $entry )
}
# Use XmlWriter to ensure same write settings as we used when calculating the XML stream size.
$writer = [Xml.XmlWriter]::Create( $outPath, $writerSettings )
$dcuxml.Save( $writer )
$writer.Dispose() # Close file
As for what you have tried:
- The count of objects in
$update_activity
array never changes, because when you callRemoveChild()
, you just remove their references from the XML tree. The objects only get deleted when there are no more references to them. But$update_activity
is an independent array of references, which keeps the nodes alive. You would have to rebuild the$update_activity
array for each loop iteration to reflect the changes to the XML tree. This would be very inefficient. - When you call
$update_activity.InnerXml
inside of the loop, you basically serialize all log entries everytime, again very inefficient. Also,InnerXml
isn't indented, so its length doesn't reflect the size of the XML as stored on disk. - You are using
UTF8.GetByteCount()
, while the document actually is encoded inUTF-16
. You would end up calculating roughly half the number of bytes as needed forUTF-16
, which uses two bytes per character. This problem is avoided in the above code by using anXmlWriter
with the right encoding setting and reading the size of the underlying stream, which is counted in bytes.
CodePudding user response:
This is an alternative solution that uses a streaming approach based on XmlReader
and XmlWriter
only. Compared to my first solution, it does not limit the size of the input file depending on the amount of available RAM.
While my first solution reads the whole input file into an XmlDocument
in memory, this one only keeps as many log entries in memory, as needed for the output file.
Also it is propably faster than the first solution, because it doesn't incur the overhead of creating a DOM (though a benchmark hasn't been done yet).
A disadvantage is that the code is more lengthy than my first solution.
$inputPath = "$PWD\log.xml"
$outputPath = "$PWD\log_new.xml"
# Maximum size of the output file (which can be slightly larger as we only
# count the size of the log entries).
$maxByteCount = 4KB
$writerSettings = [Xml.XmlWriterSettings] @{
Encoding = [Text.Encoding]::Unicode # UTF-16 as in input document
# Replace with this line to encode in UTF-8 instead
# Encoding = [Text.Encoding]::UTF8
Indent = $true
IndentChars = ' ' * 4 # should match indentation of input document
ConformanceLevel = [Xml.ConformanceLevel]::Auto
}
$entrySeparator = "`n" $writerSettings.IndentChars
$totalByteCount = 0
$queue = [Collections.Generic.Queue[object]]::new()
$reader = $writer = $null
try {
# Open the input file.
$reader = [Xml.XmlReader]::Create( $inputPath )
# Create or overwrite the output file.
$writer = [Xml.XmlWriter]::Create( $outputPath, $writerSettings )
$writer.WriteStartDocument() # write the XML declaration
# Copy the document root element and its attributes without recursing into child elements.
$null = $reader.MoveToContent()
$writer.WriteStartElement( $reader.Name )
$writer.WriteAttributes( $reader, $false )
# Read to first log entry.
$hasEntry = $reader.ReadToDescendant('LogEntry')
# Loop over the log entries of the input file.
while( $hasEntry ) {
# Read the XML of the current element and calculate how much bytes it takes when written to file.
$xmlStr = $reader.ReadOuterXml()
$byteCount = $writerSettings.Encoding.GetByteCount( $xmlStr $entrySeparator )
# Append XML and byte count of the current element to the end of the queue.
$queue.Enqueue( [PSCustomObject]@{
xmlStr = $xmlStr
byteCount = $byteCount
})
$totalByteCount = $byteCount
# Remove entries from beginning of queue to ensure maximum size is not exceeded.
while( $totalByteCount -ge $maxByteCount ) {
$totalByteCount -= $queue.Dequeue().byteCount
}
# Read to next log entry.
$hasEntry = $reader.ReadToNextSibling('LogEntry')
}
# Write the last log entries, which are below maximum size, to the output file.
foreach( $entry in $queue ) {
$writer.WriteString( $entrySeparator )
$writer.WriteRaw( $entry.xmlStr )
}
# Finish the document.
$writer.WriteString("`n")
$writer.WriteEndElement()
$writer.WriteEndDocument()
}
finally {
# Close the input and output files
if( $writer ) { $writer.Dispose() }
if( $reader ) { $reader.Dispose() }
}
The algorithm basically works like this:
- Create a queue of custom objects that store the XML and the size in bytes per log entry.
- For each log entry of the input file:
- Read the XML of the log entry and calculate the size in bytes (as on disk, applying the output encoding) of the log entry. Add this data to the end of the queue.
- If necessary, remove log entries from the beginning of the queue to ensure the desired maximum size in bytes is not exceeded.
- Write the log entries from the queue to the output file.
- For simplicity we only consider the size of the log entries, so the actual output file could be slightly larger, due to the XML declaration and the document root element.