Home > database >  Unable to create a boxplot possibly size problem
Unable to create a boxplot possibly size problem

Time:11-30

I have a following problem. I would like to plot three boxplots from three different datasets. My code:

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle(f'Pocet requestu (podobne IP dohromady). 2021{koncovka}')

ax1.boxplot(data_count_G["ip_count"])
ax1.set_xlabel(f"Google bot, n = {len(data_count_G)}")
ax1.set_ylabel("requests from ip")

ax2.boxplot(data_count_S["ip_count"])
ax2.set_xlabel(f"Seznam bot, n = {len(data_count_S)}")
ax2.set_ylabel("requests from ip")

ax3.boxplot(data_count_nGS["ip_count"])
ax3.set_xlabel(f"Bez Google bota, n = {len(data_count_nGS)}")
ax3.set_ylabel("requests from ip")

plt.tight_layout()
plt.savefig('box_request_count_GSnG.png', bbox_inches='tight')
plt.close()

However, the results looks like this:

enter image description here

When i do data_count_nGS.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92774 entries, 0 to 20899956
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   zacatek_ip  92773 non-null  object 
 1   ip_count    92773 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.1  MB

When I do data_count_nGS.describe():

            ip_count
count   92773.000000
mean      209.073351
std      1430.188719
min         1.000000
25%        70.000000
50%       107.000000
75%       194.000000
max    253248.000000

Is it possible that the problem is in the size of the last dataframe (92774)? How can I fix it please?

CodePudding user response:

You probably need to remove NaN values. As in ax3.boxplot(data_count_nGS["ip_count"].dropna()). You also could try seaborn's boxplot, which automatically removes NaNs.

With max that far away from the 75th percentile, the box will shrink to a very thin line, as some outliers will be far away. You might want to change the data limits to better see the main boxplot.

The following example code tries to mimic the situation:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

df = pd.DataFrame({"ip_count": np.round(((np.random.rand(100_000) ** 3)   1) ** 19)})
df.iloc[-1, :] = np.nan
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(12, 5))
ax1.boxplot(df["ip_count"].dropna())
ax1.set_title('default ylim, full range')
ax2.boxplot(df["ip_count"].dropna())
ax2.set_ylim(np.percentile(df["ip_count"].dropna(), [0, 80]))
ax2.set_title('ylim from 0th to 80th percentile')

sns.boxplot(y=df["ip_count"], ax=ax3)
ax3.set_ylim(np.percentile(df["ip_count"].dropna(), [0, 80]))
ax3.set_title('seaborn with ylim\nfrom 0th to 80th percentile')

plt.tight_layout()
plt.show()

boxplot with extreme outliers

  • Related