scipy.stats.cumfreq() isn't the cumulative frequency I'm looking for-CodePudding

Reading a statistic book, I'm also training with Python.

My book asks me to calculate the cumulative workforce and the cumulative frequency of a simple list of jobs.

Secteur	Nombre d'emplois
Agriculture	21143585
Construction	35197834
Industrie	69941779
Fabrication	64298386
Services	368931820

I wrote this Python program:

import numpy as np
import scipy.stats

if __name__ == '__main__':
    emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820]
    print("effectif cumulé : ", np.cumsum(emploi_par_activité))
    print("fréquences cumulées", scipy.stats.cumfreq(emploi_par_activité))

that responds me:

effectif cumulé :  [ 21143585  56341419 126283198 190581584 559513404]
fréquences cumulées CumfreqResult(cumcount=array([2., 4., 4., 4., 4., 4., 4., 4., 4., 5.]), lowerlimit=1822016.388888888, binsize=38643137.222222224, extrapoints=0)

And if my book agrees for the cumulative workforce, it doesn't for the cumulative frequency. that should be:

meaning that I've been tricked by the name of the scipy.stats: cumfreq that looks having the name of the one doing what I would like, but doesn't.

What is the proper method I should use instead?

CodePudding user response：

cumfreq is for "raw" data; that is, data that has not been counted already or aggregated by some category. If you had a big data base with length 559513404, where each record corresponds to a distinct person, and a field in that record is a number that categorizes their job, with 0=Agriculture, 1=Construction, etc., then you could apply cumfreq to the data in that field.

Your data is already aggregated by job type. To get the result that you expected, compute the cumulative sum, and then divide each element in the cumulative sum by the total (which happens to be the last element of the cumulative sum):

In [215]: emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820]

In [216]: csum = np.cumsum(emploi_par_activité)

In [217]: csum
Out[217]: array([ 21143585,  56341419, 126283198, 190581584, 559513404])

In [218]: csum/csum[-1]  # fréquences cumulées
Out[218]: array([0.03778924, 0.10069717, 0.22570183, 0.34062023, 1.        ])