My data.xlsx
is something like this
Number Grade
3217839 A
7893242 B
3782195 C
0 D
However, in excel, the Number
column text format is "Short Date", which cause the data to become
Number Grade
####### A
####### B
####### C
00-01-00 D
I used df = pd.read_csv("data.xlsx")
, and it hit error and show NaN
the error (only showing A2):
UserWarning: Cell A2 is marked as a date but the serial value 321783921 is outside the limits for dates.
The cell will be treated as an error. warn(msg)
Number Grade
0 NaN A
1 NaN B
2 NaN C
3 00:00:00 D
I have tried with df = pd.read_csv("data.xlsx", coverters={"Number": str})
. The result still the same. Any way to solve this without manually changing the format in the excel?
CodePudding user response:
Excel represents dates with a "serial" number, starting from 1 for 01.01.1900 up to 2958465 for 31.12.9999. So both Pandas and Excel complain about your values because they cannot be interpreted as date. Just change the cell format,it is simply wrong.
CodePudding user response:
If you don't have the authority to modify and then save the wb, you could use xlwings
to call Excel from within python. Open wb, change the number formatting of the cells to 'General', select your ranges, close wb and quit app.
# assuming data in wb.sheets[0].range('A1:B5')
import xlwings
import pandas as pd
my_file = 'data.xlsx'
excel_app = xlwings.App(visible=False)
excel_book = excel_app.books.open(my_file, read_only=True)
sht = excel_book.sheets[0]
sht.range('A1:A5').number_format = 'General'
my_cols = sht.range('A1:B1').value
my_range = sht.range('A2:B5').value
excel_book.close()
excel_app.quit()
df = pd.DataFrame(my_range, columns=my_cols)
# =============================================================================
#
# display(df)
#
# Number Grade
# 0 3217839.0 A
# 1 7893242.0 B
# 2 3782195.0 C
# 3 0.0 D
#
# =============================================================================