Home > Net >  Pandas Percentage of Crosstab
Pandas Percentage of Crosstab

Time:09-13

I have a dataframe 'df'. I want to create a new column to add at the end of my crosstab. I used the following commands, but the percentages are displaying as NaN values for some reason. How do I fix this error?

My code:

trainData, validData = train_test_split(df, test_size=0.4, random_state=1)

trainData['RFM'] = trainData['Mcode'].astype(str)   trainData['Rcode'].astype(str)   trainData['Fcode'].astype(str)

rfm_crosstab = pd.crosstab(index=[trainData['RFM']], columns=trainData['Florence'], margins=True)
rfm_crosstab['Percentage'] = trainData.Florence[1] / (trainData.Florence[1]   trainData.Florence[0])

rfm_crosstab

My result: Expected result should display correct percentage instead of NaN values

Florence    0   1   All Percentage
RFM             
111 3   0   3   NaN
121 3   2   5   NaN
131 9   1   10  NaN
141 14  0   14  NaN
211 3   0   3   NaN
212 1   2   3   NaN
221 6   3   9   NaN
222 6   1   7   NaN
231 20  2   22  NaN
232 22  1   23  NaN
241 24  2   26  NaN
242 18  1   19  NaN
311 9   0   9   NaN
312 6   0   6   NaN
313 2   1   3   NaN
321 17  3   20  NaN
322 20  4   24  NaN
323 2   1   3   NaN
331 35  1   36  NaN
332 30  2   32  NaN
333 19  0   19  NaN
341 61  10  71  NaN
342 58  2   60  NaN
343 17  2   19  NaN
411 12  3   15  NaN
412 19  1   20  NaN
413 15  1   16  NaN
421 29  0   29  NaN
422 26  5   31  NaN
423 32  8   40  NaN
431 75  6   81  NaN
432 68  3   71  NaN
433 71  5   76  NaN
441 96  12  108 NaN
442 109 6   115 NaN
443 103 10  113 NaN
511 19  4   23  NaN
512 23  2   25  NaN
513 44  8   52  NaN
521 24  5   29  NaN
522 44  2   46  NaN
523 74  16  90  NaN
531 110 4   114 NaN
532 102 4   106 NaN
533 177 28  205 NaN
541 107 6   113 NaN
542 122 11  133 NaN
543 282 21  303 NaN
All 2188    212 2400    NaN

Message:

/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_655/3261767759.py:4: RuntimeWarning: invalid value encountered in long_scalars
rfm_crosstab['Percentage'] = trainData.Florence[1] / (trainData.Florence[1] trainData.Florence[0])

Data:

Seq#    ID# Gender  M   R   F   FirstPurch  ChildBks    YouthBks    CookBks ... ItalCook    ItalAtlas   ItalArt Florence    Related Purchase    Mcode   Rcode   Fcode   Yes_Florence    No_Florence
0   1   25  1   297 14  2   22  0   1   1   ... 0   0   0   0   0   5   4   2   0   1
1   2   29  0   128 8   2   10  0   0   0   ... 0   0   0   0   0   4   3   2   0   1
2   3   46  1   138 22  7   56  2   1   2   ... 1   0   0   0   2   4   4   3   0   1
3   4   47  1   228 2   1   2   0   0   0   ... 0   0   0   0   0   5   1   1   0   1
4   5   51  1   257 10  1   10  0   0   0   ... 0   0   0   0   0   5   3   1   0   1

CodePudding user response:

This is mostly because (trainData.Florence[1] trainData.Florence[0]) is 0. Actually my guess is you want to calculate the percentage on rfm_crosstab and not trainData. What probably you actually want is.

rfm_crosstab = pd.crosstab(index=[trainData['RFM']], columns=trainData['Florence'], margins=True)
rfm_crosstab['percentage'] = 100*rfm_crosstab[1]/rfm_crosstab['All']
  • Related