Home > OS >  Read xls file containing xml data in python
Read xls file containing xml data in python

Time:07-09

An xls file has xml data inside it. Top portion of the file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="sDT"><NumberFormat ss:Format="Short Date"/></Style>
</Styles>
<Worksheet ss:Name="XXX">
    <Table>
        <Row>
            <Cell><Data ss:Type="String">Request ID</Data></Cell>
            <Cell><Data ss:Type="String">Date</Data></Cell>
            <Cell><Data ss:Type="String">XXX ID</Data></Cell>
            <Cell><Data ss:Type="String">Customer Name</Data></Cell>
            <Cell><Data ss:Type="String">Amount</Data></Cell>
            <Cell><Data ss:Type="String">Requested Action</Data></Cell>
            <Cell><Data ss:Type="String">Status</Data></Cell>
            <Cell><Data ss:Type="String">Transaction ID</Data></Cell>
            <Cell><Data ss:Type="String">Merchant UTR</Data></Cell>
        </Row>

How can I read it into a Pandas DataFrame using pandas.read_xml. (Any other way of reading it into a DataFrame will also do.)

Note: Have already tried various solutions using read_excel with and without engine ="openpyxl". Different errors are displayed. (See comments below. The comments also contain a link to the same problem faced by others earlier.)

CodePudding user response:

Try to define another engine:

df = pd.read_excel('test.xls', engine='xlrd')

Note that you need to install xlrd library, e.g.:

pip install xlrd

CodePudding user response:

Your file is a valid xml file. I know no automatic converter for that but Excel, but it can easily be parsed as a mere xml file, for example with BeautifulSoul.

If the internal format is simple enough, you could just process the Worksheet, row and cell tags to convert it to a csv file:

from bs4 import BeautifulSoup
import csv
import io

soup = BeautifulSoup(open('file.xxx'))
    
with open('file.csv', newline='') as fdout:
    wr = csv.writer(fdout)
    sheet = soup.find('worksheet')
    for row in sheet.findAll('row'):
        wr.writerow(cell.text for cell in row.findAll('cell'))

Using your sample data, it gives as expected:

Request ID,Date,XXX ID,Customer Name,Amount,Requested Action,Status,Transaction ID,Merchant UTR
  • Related