how can i get pandas dataframe from excel in showing datatype?-CodePudding

Environment:
python 3.8.5
ipython 7.20.0
jupyterlab 3.0.7
ubuntu 20.04 LTS
pandas 1.2.2
openpyxl 3.0.10
googledrive (spreadsheet)

I'm doing...

import pandas as pd
ef = pd.ExcelFile('myfile.xlsx')  
df = ef.parse(ef.sheet_names[0], headers=None)
display(df)

Parsing exported xlsx from google spreadsheet to dataframe.
The spreadsheet's content is following next:

The Problem
It always parse A1(=1-1) to pd.Timestamp(2022.01.01 00:00:00).
But I want string value of "1-1".
I think it's origin value already inserted to datetime type.

I tried
Most of SO's solve is next.
So i tried that.

df1 = ef.parse(ef.sheet_names[0], headers=None)
df1.columns #=[0,1,2,3,4,5]
df = ef.parse(ef.sheet_names[0], headers=None, converters={c:str for c in df1.columns})
display(df.iloc[0][0])

But it shows string value but "2022-01-01 00:00:00"

Constraints
The spreadsheet's writer(=operator) says to me "I typed exactly 1-1 on the spreadsheet"
And there are many spreadsheet writer.
So they won't input '1-1 instead of 1-1, and strictly check it is really inserted string type or datetime type.

And google spreadsheet API (not drive api) returns it's value '1-1', so it works. But that API's quota is too small (60 calls per 1min, and consumes 1 call per 1 sub-spreadsheet). So I must need google drive api and export it.
That's why I can't using spreadsheet API notwithstanding it's actually works.

hope
I got xlsx exported file from google drive api like next way.

request = _google_drive_client.files().export_media(fileId=file_id,
                                               mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
dn = file.getValue()
with open('test.xlsx', 'wb') as f:
    f.write(dn)

And my Apple Numbers shows me the information of '1-1' is still alive in that xlsx file.

So I hope I can find and make '1-1' in python again.

Question
Is there any way xlsx file load to python memory that's showing way? (I mean '1-1', not following origin value (datetime type 2022 01 01 00:00:00), or other parsing way)
Or simply I want '1-1' from parsing way.
Help me please!

CodePudding user response：

I hope I understood your question right. If you want to know how you can display 1-1 in it's actual state as straight text after exporting it to excel, I think it's best to use pandas with xlwings:

import pandas as pd
import xlwings as xw

df = pd.DataFrame([
    (1, "1-1", 8),
    (4, 0, "=1-1")],
    columns=["A", "B", "C"]
    )

with xw.App(visible=False) as app:
    wb = xw.Book()
    ws = wb.sheets[0]
    ws.api.Cells.NumberFormat = "@"  # Or instead: ws.range("A1:Z100").number_format = "@"
    ws["A1"].value = df
    wb.save("test.xlsx")
    wb.close()

The crucial point is to set the NumberFormat or rather number_format property before loading the values of df into the cells. This ensures that 1-1 appears as straight text.

As a side note: ws.api.Cells.NumberFormat = "@" changes the format of the whole sheet to text. If you prefer to change only a certain range, use ws.range("A1:Z100").number_format = "@".
Please also note that xlwings requires an excel installation.

CodePudding user response：

The problem is not pandas, of course. The problem is Excel. Unless told otherwise, it interprets 1-1 as a date. If you want to override that, start with a single quote: '1-1. The quote won't show, but Excel will treat the cell as a string.

CodePudding user response：

I answer for my own question because I found a answer.

I stop using for pandas dataframe excel parser.

I tried many mimeTypes for googledrive API, finally I dropped odf and xlsx export.

Finally I using 'zip' that makes every sheet to html, and one css file.

I downloaded it zip, and extract, and finally find the html contains the contents it shows exact same with google spreadsheet.

My solution is next:

def extract_zip(input_zip):
    input_zip=ZipFile(input_zip)
    return {name: input_zip.read(name) for name in input_zip.namelist()}

def read_json_from_zip_bytearray(file_bytearray):
    from bs4 import BeautifulSoup as soup
    from zipfile import ZipFile
    
    do = extract_zip(io.BytesIO(file_bytearray))
    
    dw = {k[:-5]:[[row.text for row in columns.find_all('td')] for columns in soup(v).find('table').find_all('tr') if columns.find_all('td')] for k, v in do.items() if k[-5:] == '.html'}
    
    return {k:[singlev for singlev in vv if singlev] for idx, vv in enumerate(v) for k, v in dw.items()}


read_json_from_zip_bytearray(downloaded_bytearray_from_googledrive_api_and_zip_mimetype)

#the value is
{'sheetname1': ['1-1',
   '1-2',
   '1-3',
   '1-4',
   '1-5',
   'other...'],
 'sheetname2': ['2-1', '2-2', '2-3'],
   ...}

I appreciate for every answers!

And I hope someone for who using google drive API & parse it keep it's own showing way.