I have a Zip archive with a number of xml files, which I would like to read into a Pandas data frame. The xml files are UTF-16 encoded, hence they can be read as:
import pandas as pd
# works
with open("data1.xml", encoding='utf-16') as f:
data = pd.read_xml(f)
# works
data = pd.read_xml("data1.xml", encoding='utf-16')
However, I cannot read the same file directly from the Zip archive without extracting it manually first.
import zipfile
import pandas as pd
# does not work
with zipfile.open("data1.xml") as f:
data = pd.read_xml(f, encoding='utf-16')
The problem seems to be the encoding, but I cannot manage to specify the UTF-16 correctly.
Many thanks for your help.
CodePudding user response:
ZipFile.open
reads in binary mode. To read as UTF-16 text wrap in a TextIoWrapper
.
Below assumes a test.zip
file with UTF-16-encoded test.xml
inside:
import zipfile
import pandas as pd
import io
z = zipfile.ZipFile('test.zip')
with z.open("test.xml") as f:
t = io.TextIOWrapper(f, encoding='utf-16')
data = pd.read_xml(t)
If the .zip file has a single .xml file in it, this works as well and is documented in pandas.read_xml
(see the compression parameter):
data = pd.read_xml('test.zip', encoding='utf-16')