I have a following problem. I would like to plot three boxplots from three different datasets. My code:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle(f'Pocet requestu (podobne IP dohromady). 2021{koncovka}')
ax1.boxplot(data_count_G["ip_count"])
ax1.set_xlabel(f"Google bot, n = {len(data_count_G)}")
ax1.set_ylabel("requests from ip")
ax2.boxplot(data_count_S["ip_count"])
ax2.set_xlabel(f"Seznam bot, n = {len(data_count_S)}")
ax2.set_ylabel("requests from ip")
ax3.boxplot(data_count_nGS["ip_count"])
ax3.set_xlabel(f"Bez Google bota, n = {len(data_count_nGS)}")
ax3.set_ylabel("requests from ip")
plt.tight_layout()
plt.savefig('box_request_count_GSnG.png', bbox_inches='tight')
plt.close()
However, the results looks like this:
When i do data_count_nGS.info()
:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 92774 entries, 0 to 20899956
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zacatek_ip 92773 non-null object
1 ip_count 92773 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.1 MB
When I do data_count_nGS.describe()
:
ip_count
count 92773.000000
mean 209.073351
std 1430.188719
min 1.000000
25% 70.000000
50% 107.000000
75% 194.000000
max 253248.000000
Is it possible that the problem is in the size of the last dataframe (92774)? How can I fix it please?
CodePudding user response:
You probably need to remove NaN values. As in ax3.boxplot(data_count_nGS["ip_count"].dropna())
. You also could try seaborn's boxplot, which automatically removes NaNs.
With max that far away from the 75th percentile, the box will shrink to a very thin line, as some outliers will be far away. You might want to change the data limits to better see the main boxplot.
The following example code tries to mimic the situation:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({"ip_count": np.round(((np.random.rand(100_000) ** 3) 1) ** 19)})
df.iloc[-1, :] = np.nan
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(12, 5))
ax1.boxplot(df["ip_count"].dropna())
ax1.set_title('default ylim, full range')
ax2.boxplot(df["ip_count"].dropna())
ax2.set_ylim(np.percentile(df["ip_count"].dropna(), [0, 80]))
ax2.set_title('ylim from 0th to 80th percentile')
sns.boxplot(y=df["ip_count"], ax=ax3)
ax3.set_ylim(np.percentile(df["ip_count"].dropna(), [0, 80]))
ax3.set_title('seaborn with ylim\nfrom 0th to 80th percentile')
plt.tight_layout()
plt.show()