I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving. Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars =
in front of the entire line and the variable counts
would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
CodePudding user response:
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True
and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins
parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut
function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist
.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table
to obtain a table with the counts for each combination of DistanceBin
and Departuretime
as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')