I am working with the credit card approval dataset from the UCI ML repository. The dataset contains missing values marked as '?'
display(cc_apps.tail(17))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
673 ? 29.50 2.000 y p e h 2.000 f f 0 f g 00256 17 -
674 a 37.33 2.500 u g i h 0.210 f f 0 f g 00260 246 -
675 a 41.58 1.040 u g aa v 0.665 f f 0 f g 00240 237 -
676 a 30.58 10.665 u g q h 0.085 f t 12 t g 00129 3 -
677 b 19.42 7.250 u g m v 0.040 f t 1 f g 00100 1 -
678 a 17.92 10.210 u g ff ff 0.000 f f 0 f g 00000 50 -
679 a 20.08 1.250 u g c v 0.000 f f 0 f g 00000 0 -
680 b 19.50 0.290 u g k v 0.290 f f 0 f g 00280 364 -
681 b 27.83 1.000 y p d h 3.000 f f 0 f g 00176 537 -
682 b 17.08 3.290 u g i v 0.335 f f 0 t g 00140 2 -
683 b 36.42 0.750 y p d v 0.585 f f 0 f g 00240 3 -
684 b 40.58 3.290 u g m v 3.500 f f 0 t s 00400 0 -
685 b 21.08 10.085 y p e h 1.250 f f 0 f g 00260 0 -
686 a 22.67 0.750 u g c v 2.000 f t 2 t g 00200 394 -
687 a 25.25 13.500 y p ff ff 2.000 f t 1 t g 00200 1 -
688 b 17.92 0.205 u g aa v 0.040 f f 0 f g 00280 750 -
689 b 35.00 3.375 u g c h 8.290 f f 0 t g 00000 0 -
I converted these '?' to NaN using the replace()
method.
cc_apps_train = cc_apps_train.replace('?', 'NaN')
but when I am printing the data frame information using the info()
method, it's not showing null value information.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
0 462 non-null object
1 462 non-null object
2 462 non-null float64
3 462 non-null object
4 462 non-null object
5 462 non-null object
6 462 non-null object
7 462 non-null float64
8 462 non-null object
9 462 non-null object
10 462 non-null int64
12 462 non-null object
14 462 non-null int64
15 462 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 54.1 KB
Can anyone please explain this?
CodePudding user response:
Your line of code:
cc_apps_train = cc_apps_train.replace('?', 'NaN')
converts string '?'
to string 'NaN'
. It counts as a non-null object, because it is a string. Change 'NaN'
to numpy.NaN
and it should work fine.