I have a csv file:
ids year mean
1 2000 200
2 2000 199
3 2000 193
4 2000 189
1 2001 205
2 2001 197
3 2001 197
4 2001 196
.
.
.
4 2016 212
I would like to loop over each individual id
to calculate the person coefficient for each of them and put them in an individual list.
How can I do that?
I tried something that took forever and never worked:
import pandas as pd
import numpy as np
import scipy.stats as stats
path = 'C:/path/'
#%%
df = pd.read_csv(path 'mycsvfile.csv')
res = []
for i in range(df['id'].min(), df['id'].max()):
x = stats.pearsonr(df['year'], df['mean'])
res.append(x)
df = pd.DataFrame(res)
CodePudding user response:
Note that in
for i in range(df['id'].min(), df['id'].max()):
x = stats.pearsonr(df['year'], df['mean'])
res.append(x)
you have i
, which is never used in for loop body, so you in fact does compute very same thing again and again.
What you need is groupby, consider following simple example
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,3],'x':[1,2,3,4,5,6],'y':[1,2,4,3,5,6]})
out = df.groupby('id').apply(lambda data:stats.pearsonr(data['x'],data['y']))
print(out)
output
id
1 (1.0, 1.0)
2 (-1.0, 1.0)
3 (1.0, 1.0)
dtype: object
Explanation: groupby
id, then apply Pearson's R computing for each group.
CodePudding user response:
Using groupby
and apply
should simplify your code.
csv = """ids year mean
1 2000 200
2 2000 199
3 2000 193
4 2000 189
1 2001 205
2 2001 197
3 2001 197
4 2001 196
1 2002 206
2 2002 198
3 2002 199
4 2002 200
4 2016 212
"""
from io import StringIO
df = pd.read_csv(StringIO(csv), sep='\s ')
res = df.groupby('ids').apply(lambda d: pd.Series(stats.pearsonr(d['year'], d['mean'])))
res.columns = ['r', 'p_value']
res
outputs:
r p_value
ids
1 0.933257 0.233908
2 -0.500000 0.666667
3 0.981981 0.121038
4 0.927045 0.072955
But this also contains loop (because of apply). If you don't need p-value
from stats.pearsonr, you can use
df.groupby('ids').corr().unstack().iloc[:, 2].to_frame('r').reset_index()