I'm trying to fit various distributions onto my data and test (chi-squared?) which fits best. I started out by using the gumbel_r
distribution of scipy, as this is the one often used in literature.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as ss
data = pd.read_csv("data.csv")
data
sns.histplot(data["score"], kde=True, stat='probability')
plt.show()
x = np.linspace(0,1,101)
hist, bins = np.histogram(data["score"], bins=x, density=True)
loc, scale = ss.gumbel_r.fit(hist)
dist = ss.gumbel_r(loc=loc,scale=scale)
plt.plot(x, dist.pdf(x))
plt.show()
Inspecting the plots yields strange results. For example my data has a peak at ~0.09 of around ~0.025. However, the plotted gumbel looks completely off.
My questions are now:
- Why are the plots not looking similar? I'm also suspecting
stat='probability'
could be the culprit here? - What do I need to do, such that the second plot will look somewhat similar to the first one?
- Optimally I would get another
hist
for the same bins of the fitted distribution and input intoscipy.stats.chisquare
to quantify how good the fit of the distribution is and see which fits best. Is that correct?
CodePudding user response:
Don't give hist
to gumbel_r.fit()
. It expects the original data. Change the line that calls fit()
to
loc, scale = ss.gumbel_r.fit(data['score'].to_numpy())
Also, to get the Seaborn plot on the same scale as the plot of the PDF, change stat='probability'
to stat='density'
in the histplot()
call.