How to loop over specifc ids in a csv file?-CodePudding

I have a csv file:

ids    year    mean
1      2000    200
2      2000    199
3      2000    193
4      2000    189
1      2001    205
2      2001    197
3      2001    197
4      2001    196
.
.
.
4      2016    212

I would like to loop over each individual id to calculate the person coefficient for each of them and put them in an individual list. How can I do that?

I tried something that took forever and never worked:

import pandas as pd
import numpy as np
import scipy.stats as stats

path = 'C:/path/'
#%%
df = pd.read_csv(path   'mycsvfile.csv')

res = []
for i in range(df['id'].min(), df['id'].max()):
    x = stats.pearsonr(df['year'], df['mean'])
    res.append(x)

df = pd.DataFrame(res)

CodePudding user response：

Note that in

for i in range(df['id'].min(), df['id'].max()):
    x = stats.pearsonr(df['year'], df['mean'])
    res.append(x)

you have i, which is never used in for loop body, so you in fact does compute very same thing again and again. What you need is groupby, consider following simple example

import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,3],'x':[1,2,3,4,5,6],'y':[1,2,4,3,5,6]})
out = df.groupby('id').apply(lambda data:stats.pearsonr(data['x'],data['y']))
print(out)

output

id
1     (1.0, 1.0)
2    (-1.0, 1.0)
3     (1.0, 1.0)
dtype: object

Explanation: groupby id, then apply Pearson's R computing for each group.

CodePudding user response：

Using groupby and apply should simplify your code.

csv = """ids    year    mean
1      2000    200
2      2000    199
3      2000    193
4      2000    189
1      2001    205
2      2001    197
3      2001    197
4      2001    196
1      2002    206
2      2002    198
3      2002    199
4      2002    200
4      2016    212
"""
from io import StringIO
df = pd.read_csv(StringIO(csv), sep='\s ')
res = df.groupby('ids').apply(lambda d: pd.Series(stats.pearsonr(d['year'], d['mean'])))
res.columns = ['r', 'p_value']
res

outputs:

            r   p_value
ids                    
1    0.933257  0.233908
2   -0.500000  0.666667
3    0.981981  0.121038
4    0.927045  0.072955

But this also contains loop (because of apply). If you don't need p-value from stats.pearsonr, you can use

df.groupby('ids').corr().unstack().iloc[:, 2].to_frame('r').reset_index()