Consider every possible pair, remove NaN entries and obtain the linear regression-CodePudding

I'm working with Python for the first time so I've few difficulties.

I've a csv file of data with 6 columns and 20 rows, some entries are 'NaN'. What I'm trying to do is an interaction that would consider the first column vs the others (A vs B,C,D,E and F), then the second column with all the others (B vs C,D,E and F) etc (i.e compare every possible pair of the 6 columns), for each pair remove the NaN entries (so the number of rows for each final pair would be different) and calculate the linear regression.

I can obtain the results I'm looking for by considering each pair "manually", but since I've quite a lot of DataFrames for which I've to do this procedure, it'd take a lot of time, so I'm hoping someone would help me with a faster method.

Part of my data:

| A      | B      | C       | D    | E             | F            |

| 70.385 | 1316.0 | NaN     | 1.84 | 1.021059e 37  | 1.284026e 41 |

| 13.183 | 800.0  | 11549.0 | 1.66 | 4.710032e 35  | Nan          |

| 9.750  | NaN    | NaN     | 1.55 |  1.437108e 36 | 5.070657e 40 |

| 12.302 | NaN    | 547.7   | 1.56 | 2.149507e 36  | 2.294859e 40 |

| NaN    | 2784.2 | 29984.4 | 1.87 | NaN           | 2.294859e 40 |

what I'm able to do is:

import pandas as pd
from scipy.stats import linregress

df=data1.dropna(subset=['A','B']) 
lr_AB=linregress(df['A'],df['B'])

and what I'm trying to obtain is: LinregressResult(slope= , intercept=, rvalue=, pvalue=, stderr=) for each pair. So, how do I iterate these two commands for every possible pair of columns?

My attempt was this, but without much results:

for i in range(len(data1.columns)-1):
    if data1.iloc[:, :] is 'NaN':
        df= data1.dropna()
        print(df)
        lr= linregress(df.iloc[:, i], df.iloc[:, i 1])
    else:
        print('no nan') 
        lr1= linregress(data1.iloc[:, i], data1.iloc[:, i 1])

Thank you so much.

CodePudding user response：

Assuming you'd like to use scipy.stats's linregress, here is a way to do this. Not much more efficient than your code, as it also loops over the combinations of columns:

from scipy.stats import linregress
from itertools import combinations

for a, b in combinations(df, 2):
    ab = df[[a, b]].dropna()
    lr[(a, b)] = dict(zip(
        'slope intercept r_value p_value std_err'.split(),
        linregress(ab[a], ab[b])
    ))
    lr[(a, b)]['n_points'] = ab.shape[0]

result = pd.DataFrame(lr).T

Sample run on your example data -

>>> result
            slope     intercept   r_value   p_value       std_err  n_points
A B  9.020664e 00  6.810806e 02  1.000000  0.000000  0.000000e 00       2.0
  C  1.248729e 04 -1.530709e 05  1.000000  0.000000  0.000000e 00       2.0
  D  4.316155e-03  1.538532e 00  0.942012  0.057988  1.087230e-03       4.0
  E  1.503028e 35 -4.016931e 35  0.984424  0.015576  1.898085e 34       4.0
B C  9.291100e 00  4.116120e 03  1.000000  0.000000  0.000000e 00       2.0
  D  8.706011e-05  1.647796e 00  0.789115  0.421079  6.776741e-05       3.0
  E  1.887517e 34 -1.462913e 37  1.000000  0.000000  0.000000e 00       2.0
C D  1.061987e-05  1.547701e 00  0.998360  0.036468  6.090086e-07       3.0
  E -1.525732e 32  2.233071e 36 -1.000000  0.000000  0.000000e 00       2.0
D E  2.896150e 37 -4.429182e 37  0.869114  0.130886  1.165450e 37       4.0
A F  1.525539e 39  2.034718e 40  0.957112  0.187122  4.617784e 38       3.0
B F -7.182537e 37  2.229248e 41 -1.000000  0.000000  0.000000e 00       2.0
C F  0.000000e 00  2.294859e 40  0.000000  1.000000  0.000000e 00       2.0
D F  1.097658e 41 -1.308991e 41  0.382452  0.617548  1.875151e 41       4.0
E F  1.061306e 04  1.854240e 40  0.946081  0.210009  3.633822e 03       3.0