I have a DataFrame that, when simplified, looks something like:
ID | X | Y |
---|---|---|
2 | 0 | 0 |
2 | 1 | 3.3 |
2 | 4 | 9 |
2 | 6 | 12 |
7 | 0 | 0.2 |
7 | 2 | 1 |
7 | 3 | 6 |
7 | 5 | 7 |
7 | 7 | 9 |
However, there are more columns (19 total) with other info, and there are thousands of rows, with 70 unique 'ID' numbers.
What I want is to, for each ID 'group', calculate the best-fit line between X and Y, and store that information as something like:
ID | Slope | Intercept | R2 |
---|---|---|---|
2 | 1.95 | 0.70 | 0.99 |
7 | 1.34 | 0.08 | 0.88 |
which summarizes this best-fit information for each 'ID'. Here Slope and Intercept are the slope and intercept of the linear best-fit line, and R2 is the R-squared value for some measure of error.
I can of course do this manually using something like:
import scipy
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(a, b)
r2 = r_value**2
which is how I filled those in in the second table. But how could I automate this for all the rows within in each 'ID' group? I imagine using df.groupby('ID') somehow, but I don't know how to combine both methods at a large scale...
Any help welcome, thanks!
CodePudding user response:
You can group by ID
and then use apply
:
res = df.groupby('ID').apply(lambda x: scipy.stats.linregress(x['X'], x['Y']))
stats_df = pd.DataFrame(res.tolist(), index=res.index).reset_index().round(2)
Output:
>>> stats_df
ID slope intercept rvalue pvalue stderr
0 2 1.95 0.70 0.99 0.01 0.17
1 7 1.34 0.08 0.94 0.02 0.28