Home > Net >  Strange characters found in XML file and PowerShell output after exporting from Excel: ​
Strange characters found in XML file and PowerShell output after exporting from Excel: ​

Time:04-13

I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​

I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.

In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?

I tried using -Encoding UTF8, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!

PowerShell

$xmlpath = 'Path\To\The\File.xml'

[xml]$xmldata = (Get-Content $xmlpath)

$xmldata.applications.application.name

Example of Output

​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com

CodePudding user response:

This is a prime example of why you shouldn't use the idiom [xml]$xmldata = (Get-Content $xmlpath) - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.

Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:

($xmlData = [xml]::new()).Load((Convert-Path $xmlPath))

This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g.,
<?xml version="1.0" encoding="iso-8859-1"?>)

See also:


[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content's -Encoding parameter in your original approach ([xml]$xmldata = (Get-Content -Encoding utf8 $xmlpath), but the .Load()-based approach is much more robust.

  • Related