Environment:
python 3.8.5
ipython 7.20.0
jupyterlab 3.0.7
ubuntu 20.04 LTS
pandas 1.2.2
openpyxl 3.0.10
googledrive (spreadsheet)
I'm doing...
import pandas as pd
ef = pd.ExcelFile('myfile.xlsx')
df = ef.parse(ef.sheet_names[0], headers=None)
display(df)
Parsing exported xlsx from google spreadsheet to dataframe.
The spreadsheet's content is following next:
The Problem
It always parse A1(=1-1) to pd.Timestamp(2022.01.01 00:00:00).
But I want string value of "1-1".
I think it's origin value already inserted to datetime type.
I tried
Most of SO's solve is next.
So i tried that.
df1 = ef.parse(ef.sheet_names[0], headers=None)
df1.columns #=[0,1,2,3,4,5]
df = ef.parse(ef.sheet_names[0], headers=None, converters={c:str for c in df1.columns})
display(df.iloc[0][0])
But it shows string value but "2022-01-01 00:00:00"
Constraints
The spreadsheet's writer(=operator) says to me "I typed exactly 1-1 on the spreadsheet"
And there are many spreadsheet writer.
So they won't input '1-1 instead of 1-1, and strictly check it is really inserted string type or datetime type.
And google spreadsheet API (not drive api) returns it's value '1-1', so it works. But that API's quota is too small (60 calls per 1min, and consumes 1 call per 1 sub-spreadsheet). So I must need google drive api and export it.
That's why I can't using spreadsheet API notwithstanding it's actually works.
hope
I got xlsx exported file from google drive api like next way.
request = _google_drive_client.files().export_media(fileId=file_id,
mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
dn = file.getValue()
with open('test.xlsx', 'wb') as f:
f.write(dn)
And my Apple Numbers shows me the information of '1-1' is still alive in that xlsx file.
So I hope I can find and make '1-1' in python again.
Question
Is there any way xlsx file load to python memory that's showing way? (I mean '1-1', not following origin value (datetime type 2022 01 01 00:00:00), or other parsing way)
Or simply I want '1-1' from parsing way.
Help me please!
CodePudding user response:
I hope I understood your question right. If you want to know how you can display 1-1
in it's actual state as straight text after exporting it to excel, I think it's best to use pandas with xlwings:
import pandas as pd
import xlwings as xw
df = pd.DataFrame([
(1, "1-1", 8),
(4, 0, "=1-1")],
columns=["A", "B", "C"]
)
with xw.App(visible=False) as app:
wb = xw.Book()
ws = wb.sheets[0]
ws.api.Cells.NumberFormat = "@" # Or instead: ws.range("A1:Z100").number_format = "@"
ws["A1"].value = df
wb.save("test.xlsx")
wb.close()
The crucial point is to set the NumberFormat
or rather number_format
property before loading the values of df
into the cells. This ensures that 1-1
appears as straight text.
As a side note: ws.api.Cells.NumberFormat = "@"
changes the format of the whole sheet to text. If you prefer to change only a certain range, use ws.range("A1:Z100").number_format = "@"
.
Please also note that xlwings requires an excel installation.
CodePudding user response:
The problem is not pandas, of course. The problem is Excel. Unless told otherwise, it interprets 1-1
as a date. If you want to override that, start with a single quote: '1-1
. The quote won't show, but Excel will treat the cell as a string.
CodePudding user response:
I answer for my own question because I found a answer.
I stop using for pandas dataframe excel parser.
I tried many mimeTypes for googledrive API, finally I dropped odf and xlsx export.
Finally I using 'zip' that makes every sheet to html, and one css file.
I downloaded it zip, and extract, and finally find the html contains the contents it shows exact same with google spreadsheet.
My solution is next:
def extract_zip(input_zip):
input_zip=ZipFile(input_zip)
return {name: input_zip.read(name) for name in input_zip.namelist()}
def read_json_from_zip_bytearray(file_bytearray):
from bs4 import BeautifulSoup as soup
from zipfile import ZipFile
do = extract_zip(io.BytesIO(file_bytearray))
dw = {k[:-5]:[[row.text for row in columns.find_all('td')] for columns in soup(v).find('table').find_all('tr') if columns.find_all('td')] for k, v in do.items() if k[-5:] == '.html'}
return {k:[singlev for singlev in vv if singlev] for idx, vv in enumerate(v) for k, v in dw.items()}
read_json_from_zip_bytearray(downloaded_bytearray_from_googledrive_api_and_zip_mimetype)
#the value is
{'sheetname1': ['1-1',
'1-2',
'1-3',
'1-4',
'1-5',
'other...'],
'sheetname2': ['2-1', '2-2', '2-3'],
...}
I appreciate for every answers!
And I hope someone for who using google drive API & parse it keep it's own showing way.