How to find best-fit line between two columns in Pandas Dataframe, but split up by each individual g-CodePudding

I have a DataFrame that, when simplified, looks something like:

ID	X	Y
2	0	0
2	1	3.3
2	4	9
2	6	12
7	0	0.2
7	2	1
7	3	6
7	5	7
7	7	9

However, there are more columns (19 total) with other info, and there are thousands of rows, with 70 unique 'ID' numbers.

What I want is to, for each ID 'group', calculate the best-fit line between X and Y, and store that information as something like:

ID	Slope	Intercept	R2
2	1.95	0.70	0.99
7	1.34	0.08	0.88

which summarizes this best-fit information for each 'ID'. Here Slope and Intercept are the slope and intercept of the linear best-fit line, and R2 is the R-squared value for some measure of error.

I can of course do this manually using something like:

import scipy
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(a, b)
r2 = r_value**2

which is how I filled those in in the second table. But how could I automate this for all the rows within in each 'ID' group? I imagine using df.groupby('ID') somehow, but I don't know how to combine both methods at a large scale...

Any help welcome, thanks!

CodePudding user response：

You can group by ID and then use apply:

res = df.groupby('ID').apply(lambda x: scipy.stats.linregress(x['X'], x['Y']))
stats_df = pd.DataFrame(res.tolist(), index=res.index).reset_index().round(2)

Output:

>>> stats_df
   ID  slope  intercept  rvalue  pvalue  stderr
0   2   1.95       0.70    0.99    0.01    0.17
1   7   1.34       0.08    0.94    0.02    0.28