I have a dataframe 'df'. I want to create a new column to add at the end of my crosstab. I used the following commands, but the percentages are displaying as NaN values for some reason. How do I fix this error?
My code:
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
trainData['RFM'] = trainData['Mcode'].astype(str) trainData['Rcode'].astype(str) trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index=[trainData['RFM']], columns=trainData['Florence'], margins=True)
rfm_crosstab['Percentage'] = trainData.Florence[1] / (trainData.Florence[1] trainData.Florence[0])
rfm_crosstab
My result: Expected result should display correct percentage instead of NaN values
Florence 0 1 All Percentage
RFM
111 3 0 3 NaN
121 3 2 5 NaN
131 9 1 10 NaN
141 14 0 14 NaN
211 3 0 3 NaN
212 1 2 3 NaN
221 6 3 9 NaN
222 6 1 7 NaN
231 20 2 22 NaN
232 22 1 23 NaN
241 24 2 26 NaN
242 18 1 19 NaN
311 9 0 9 NaN
312 6 0 6 NaN
313 2 1 3 NaN
321 17 3 20 NaN
322 20 4 24 NaN
323 2 1 3 NaN
331 35 1 36 NaN
332 30 2 32 NaN
333 19 0 19 NaN
341 61 10 71 NaN
342 58 2 60 NaN
343 17 2 19 NaN
411 12 3 15 NaN
412 19 1 20 NaN
413 15 1 16 NaN
421 29 0 29 NaN
422 26 5 31 NaN
423 32 8 40 NaN
431 75 6 81 NaN
432 68 3 71 NaN
433 71 5 76 NaN
441 96 12 108 NaN
442 109 6 115 NaN
443 103 10 113 NaN
511 19 4 23 NaN
512 23 2 25 NaN
513 44 8 52 NaN
521 24 5 29 NaN
522 44 2 46 NaN
523 74 16 90 NaN
531 110 4 114 NaN
532 102 4 106 NaN
533 177 28 205 NaN
541 107 6 113 NaN
542 122 11 133 NaN
543 282 21 303 NaN
All 2188 212 2400 NaN
Message:
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_655/3261767759.py:4: RuntimeWarning: invalid value encountered in long_scalars
rfm_crosstab['Percentage'] = trainData.Florence[1] / (trainData.Florence[1] trainData.Florence[0])
Data:
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
CodePudding user response:
This is mostly because (trainData.Florence[1] trainData.Florence[0])
is 0. Actually my guess is you want to calculate the percentage on rfm_crosstab
and not trainData
. What probably you actually want is.
rfm_crosstab = pd.crosstab(index=[trainData['RFM']], columns=trainData['Florence'], margins=True)
rfm_crosstab['percentage'] = 100*rfm_crosstab[1]/rfm_crosstab['All']