This is an example dataframe, my actual dataframe has 100s more rows.
nums_1 nums_2 nums_3
1 1 8
2 1 7
3 5 9
Is there a method that will calculate the 95% confidence interval across each row? A method that would work for large dataframe?
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
CodePudding user response:
You can use stats.norm.interval
and find confidence interval at 95% level with numpy.mean
and numpy.std
of values in each row like below:
from scipy import stats
import numpy as np
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
df['95_interval'] = df.apply(lambda row : \
stats.norm.interval(0.95, loc=np.mean(row),
scale=np.std(row)), axis=1)
Output:
>>> df
nums_1 nums_2 nums_3 95_interval
0 1 1 8 (-3.134217846965163, 9.80088451363183)
1 2 1 7 (-1.8109239490159825, 8.477590615682649)
2 3 5 9 (0.7776575196232134, 10.55567581371012)
CodePudding user response:
You can use:
from scipy import stats
df.apply(lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x)), axis=1)
You will obtain essentially the same results by using the following:
import statsmodels.stats.api as sms
df.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)
Both answers return the same result - tuples.
The answer is described here: Compute a confidence interval from sample data What is important to understand is that it works correctly if each row (each sample) is drawn independently from a normal distribution with an unknown standard deviation.
When it comes to large dataframes, the easy solution is to use swifter. However, it only speeds up your calculations twice. Nevertheless, it is worth trying: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66
import statsmodels.stats.api as SMS
import swifter
df.swifter.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)