Xi = pd.DataFrame([("Guyana", 5.78, 6.89), ("Paraguay", 7.29, 8.5), ("Ecuador", 9.35, 10.92), ("Peru", 9.96, 11.55), ("Kolumbien", 10.9, 12.71),
("Costa Rica", 13.0, 14.27), ("Brasilien", 14.4, 15.23), ("Venezuela", 16.56, 16.77), ("Argentina", 18.71, 18.8), ("Chile", 19.36, 21.92)], columns=["country", "GDP/L 2010", "GDP/L 2014"])
Xi.describe()
So obviously the 75% quantile is (N 1)*3/4 which gives us Xi with i = 8,25. So the 75% quantile for 2010 equals 17,635 and not like the describe() method outputs 16,02. Why is that?
CodePudding user response:
I've been looking into it and for some reason it seems that both functions describe() and quantile(0.75) return the wrong value of the percentile 75%. Even more, they return the correct percentile 75% if the last row did not exist (these methods are not taking into account the last row).
CodePudding user response:
I think your interpretation of the quantile is incorrect.
Starting from this sorted Series:
s = Xi['GDP/L 2010']
print(s)
0 5.78
1 7.29
2 9.35
3 9.96
4 10.90
5 13.00
6 14.40
7 16.56
8 18.71
9 19.36
The 0.75 quantile is computed using the 0.75th value if any of the values before and after.
Here We have 10 elements, and python starts counting from zero, so the wanted index (assuming sorted values!) is:
(len(Xi)-1)*3/4 # 6.75
There is not 6.75 so we have many possibilities to get the quantile
# closest index
# closest is 7 so the quantile is
16.56
# mid-point: (14.40 16.56)/2
15.48
# linear interpolation
# as the index is 6.75 weighted mean of 14.40 and 16.56
16.020
By default, quantile
is using the linear interpolation. You can change it using the interpolation
parameter:
s = Xi['GDP/L 2010']
s.quantile(0.75, interpolation='linear')
# 16.02
s.quantile(0.75, interpolation='nearest')
# 16.56
s.quantile(0.75, interpolation='midpoint')
# 15.48