Create a stacked bar plot and annotate with count and percent with focus of displaying small values-CodePudding

I have the following dataframe

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib 
print('matplotlib: {}'.format(matplotlib.__version__))
# 3.5.3

df=pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
                 'Length': [42,21,11,6,6,42,21,11,6,6,42],
                 'label': [1,1,0,0,0,1,1,0,0,0,1],
                 })
print(df)
#       Type     Length  label
#0   Sentence      42      1
#1      Array      21      1
#2     String      11      0
#3          -       6      0
#4          -       6      0
#5   Sentence      42      1
#6      Array      21      1
#7     String      11      0
#8          -       6      0
#9          -       6      0
#10  Sentence      42      1

I want to plot stacked bar chart for the arbitrary column within dataframe (either numerical e.g. Length column or

CodePudding user response：

The values in Expected output do not match df in the OP, so the sample DataFrame has been updated.

Plot with

Comment Updates

How to always have a spot for 'Array' if it's not in the data:
- Add 'Array' to dfp if it's not in dfp.index.
- df.Type = pd.Categorical(df.Type, ['-', 'Array', 'Sentence', 'String'], ordered=True) does not ensure the missing categories are plotted.
How to have all the annotations, even if they're small:
- Don't stack the bars, and set logy=True.
This uses the full-data, which was provided in a link.

# pivot the dataframe and get len
dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len) 

# append Array if it's not included
if 'Array' not in dfp.index:
    dfp = pd.concat([dfp, pd.DataFrame({0: [np.nan], 1: [np.nan]}, index=['Array'])])
    
# order the index
dfp = dfp.loc[['-', 'Array', 'Sentence', 'String'], :]

# calculate the percent for each row
per = dfp.div(dfp.sum(axis=1), axis=0).mul(100).round(2)

# plot the pivoted dataframe
ax = dfp.plot(kind='bar', stacked=False, figsize=(10, 8), rot=0, logy=True, width=0.75)

# iterate through the containers
for c in ax.containers:
    
    # get the current segement label (a string); corresponds to column / legend
    label = c.get_label()
    
    # create custom labels with the bar height and the percent from the per column
    # the column labels in per and dfp are int, so convert label to int
    labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])]
    
    # add the annotation
    ax.bar_label(c, labels=labels, label_type='edge', fontsize=10, fontweight='bold')
    
# move the legend
ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left')

# pad the spacing between the number and the edge of the figure
_ = ax.margins(y=0.1)

DataFrame Views

Based on the sample data in the OP

`df`

        Type  Length  label
0   Sentence      42      1
1      Array      21      1
2     String      11      0
3          -       6      0
4          -       6      0
5   Sentence      42      1
6      Array      21      1
7     String      11      0
8          -       6      0
9          -       6      1
10  Sentence      42      0

`dfp`

label       0    1
Type              
-         3.0  1.0
Array     NaN  2.0
Sentence  1.0  2.0
String    2.0  NaN

`total`

Type
-           4.0
Array       2.0
Sentence    3.0
String      2.0
dtype: float64

`per`

label          0       1
Type                    
-          75.00   25.00
Array        NaN  100.00
Sentence   33.33   66.67
String    100.00     NaN