pandas.read_excel() na_values not working correctly-CodePudding

As title states, after reviewing docs

I am reading an .xlsx file, with a column 'HOUR' which has many values, when an instance has value 99, i want to convert to None

I have tried the na_values param with different values:

na_values = ['99'] 
na_values = [r'99']
na_values = 99
...

To then read the excel like this:

accidents = pd.read_excel(filename, sheet_name= 'datos', na_values=[99])

but it doesn't seem to work, when i do:

np.sum(accidents['HOUR'] == 99)

I get a value > than 0 (which means the instances with value = 99 have not been transformed to None/NaN)

I have also read that i should include the option

keep_default_na=False

but to no avail.

The values of the 'HOUR' column are:

accidents['HOUR'].unique()
array([ 8, 15,  9, 14, 11,  0, 13, 20,  3, 19, 17,  7, 22, 21, 16,  6, 23,
   18, 10, 12,  1, 99,  4,  5,  2, 24], dtype=int64)

I have updated my pandas version to 1.5.1 and it still doesn't work, any ideas why?

.xls file can be found in: http://www.transtats.bts.gov/Fields.asp?Table_ID=1158

Thank you

CodePudding user response：

Just apply replace method on the dataframe after reading the excel file:

df.replace(99, np.nan)

If you want to replace values for only specific column like Hour:

df['HOUR'].replace(99, np.nan)

Update:

I think you want to know why read_excel() method isn't working with the na values you provided, if you check the documentation for the method:

na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN.

It only accept strings in the na_values paramater, so you need to pass it as string '99' in order to work in your case.

For some reasons it's not working for integer na_values in excel sheets. according to docs read_excel method column types are inferred but can be explicitly specified. maybe it doesn't set HOUR col type correctly so the na_values not working. I found a work around by specifying column data type in the method explicitly and it worked perfectly:

accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=['99'])

accidents = pd.read_excel(filename, sheet_name= 'datos', dtype={'HOUR': str}, na_values=[99])

CodePudding user response：

Okay, an update.

I still have not figured out why the read_excel() function is not working as expected

If anyone has this problem, i suggest transforming .xls/.xlsx file to .csv and changing the read function like this:

accidents = pd.read_excel(filename, sheet = 0, na_values = missing_values_dict, keep_default_na=False)

accidents = pd.read_csv(filename , sep = ';', na_values= missing_values_dict, keep_default_na=False)

with missing_values_dict something like:

missing_values_dict = {
  'WEEKDAY': '9',
  'HOUR': '99'
}

any questions let me know!