As title states, after reviewing docs
I am reading an .xlsx file, with a column 'HOUR' which has many values, when an instance has value 99, i want to convert to None
I have tried the na_values param with different values:
na_values = ['99']
na_values = [r'99']
na_values = 99
...
To then read the excel like this:
accidents = pd.read_excel(filename, sheet_name= 'datos', na_values=[99])
but it doesn't seem to work, when i do:
np.sum(accidents['HOUR'] == 99)
I get a value > than 0 (which means the instances with value = 99 have not been transformed to None/NaN)
I have also read that i should include the option
keep_default_na=False
but to no avail.
The values of the 'HOUR' column are:
accidents['HOUR'].unique()
array([ 8, 15, 9, 14, 11, 0, 13, 20, 3, 19, 17, 7, 22, 21, 16, 6, 23,
18, 10, 12, 1, 99, 4, 5, 2, 24], dtype=int64)
I have updated my pandas version to 1.5.1 and it still doesn't work, any ideas why?
.xls file can be found in: http://www.transtats.bts.gov/Fields.asp?Table_ID=1158
Thank you
CodePudding user response:
Just apply replace
method on the dataframe after reading the excel file:
df.replace(99, np.nan)
If you want to replace values for only specific column like Hour:
df['HOUR'].replace(99, np.nan)
Update:
I think you want to know why read_excel()
method isn't working with the na values you provided, if you check the documentation for the method:
na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN.
It only accept strings in the na_values
paramater, so you need to pass it as string '99' in order to work in your case.
For some reasons it's not working for integer
na_values in excel sheets. according to docs read_excel
method column types are inferred but can be explicitly specified. maybe it doesn't set HOUR
col type correctly so the na_values
not working.
I found a work around by specifying column data type in the method explicitly and it worked perfectly:
accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=['99'])
accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=[99])
CodePudding user response:
Okay, an update.
I still have not figured out why the read_excel() function is not working as expected
If anyone has this problem, i suggest transforming .xls/.xlsx file to .csv and changing the read function like this:
accidents = pd.read_excel(filename, sheet = 0, na_values = missing_values_dict, keep_default_na=False)
accidents = pd.read_csv(filename , sep = ';', na_values= missing_values_dict, keep_default_na=False)
with missing_values_dict something like:
missing_values_dict = {
'WEEKDAY': '9',
'HOUR': '99'
}
any questions let me know!