I have the following dataframe
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# 3.5.3
df=pd.DataFrame({'Type': [ 'Sentence', 'Array', 'String', '-','-', 'Sentence', 'Array', 'String', '-','-', 'Sentence'],
'Length': [42,21,11,6,6,42,21,11,6,6,42],
'label': [1,1,0,0,0,1,1,0,0,0,1],
})
print(df)
# Type Length label
#0 Sentence 42 1
#1 Array 21 1
#2 String 11 0
#3 - 6 0
#4 - 6 0
#5 Sentence 42 1
#6 Array 21 1
#7 String 11 0
#8 - 6 0
#9 - 6 0
#10 Sentence 42 1
I want to plot stacked bar chart for the arbitrary column within dataframe (either numerical e.g. Length
column or
CodePudding user response:
The values in Expected output do not match
df
in the OP, so the sample DataFrame has been updated.-
Comment Updates
- How to always have a spot for
'Array'
if it's not in the data:- Add
'Array'
todfp
if it's not indfp.index
. df.Type = pd.Categorical(df.Type, ['-', 'Array', 'Sentence', 'String'], ordered=True)
does not ensure the missing categories are plotted.
- Add
- How to have all the annotations, even if they're small:
- Don't stack the bars, and set
logy=True
.
- Don't stack the bars, and set
- This uses the full-data, which was provided in a link.
# pivot the dataframe and get len dfp = df.pivot_table(index='Type', columns='label', values='Length', aggfunc=len) # append Array if it's not included if 'Array' not in dfp.index: dfp = pd.concat([dfp, pd.DataFrame({0: [np.nan], 1: [np.nan]}, index=['Array'])]) # order the index dfp = dfp.loc[['-', 'Array', 'Sentence', 'String'], :] # calculate the percent for each row per = dfp.div(dfp.sum(axis=1), axis=0).mul(100).round(2) # plot the pivoted dataframe ax = dfp.plot(kind='bar', stacked=False, figsize=(10, 8), rot=0, logy=True, width=0.75) # iterate through the containers for c in ax.containers: # get the current segement label (a string); corresponds to column / legend label = c.get_label() # create custom labels with the bar height and the percent from the per column # the column labels in per and dfp are int, so convert label to int labels = [f'{v.get_height()}\n({row}%)' if v.get_height() > 0 else '' for v, row in zip(c, per[int(label)])] # add the annotation ax.bar_label(c, labels=labels, label_type='edge', fontsize=10, fontweight='bold') # move the legend ax.legend(title='Class', bbox_to_anchor=(1, 1.01), loc='upper left') # pad the spacing between the number and the edge of the figure _ = ax.margins(y=0.1)
DataFrame Views
- Based on the sample data in the OP
df
Type Length label 0 Sentence 42 1 1 Array 21 1 2 String 11 0 3 - 6 0 4 - 6 0 5 Sentence 42 1 6 Array 21 1 7 String 11 0 8 - 6 0 9 - 6 1 10 Sentence 42 0
dfp
label 0 1 Type - 3.0 1.0 Array NaN 2.0 Sentence 1.0 2.0 String 2.0 NaN
total
Type - 4.0 Array 2.0 Sentence 3.0 String 2.0 dtype: float64
per
label 0 1 Type - 75.00 25.00 Array NaN 100.00 Sentence 33.33 66.67 String 100.00 NaN
- How to always have a spot for