Let's say I have the following data:
day | query | num_searches |
---|---|---|
1 | abc | 2 |
1 | def | 3 |
2 | abc | 6 |
3 | abc | 5 |
4 | def | 1 |
4 | abc | 3 |
5 | abc | 7 |
6 | abc | 8 |
7 | abc | 10 |
8 | abc | 1 |
I'd like to generate z-score (excluding the current row's value) such that for query 'abc':
Day 1: [6, 5, 3, 7, 8, 10, 1] (exclude the value 2) zscore = -1.32
Day 2: [2, 5, 3, 7, 8, 10, 1] (exclude the value 6) zscore = 0.28
...
Day 7: [2, 6, 5, 3, 7, 8, 1] (exclude the value 10) zscore = 2.22
Day 8: [2, 6, 5, 3, 7, 8, 10] (exclude the value 1) zscore = -1.88
I have the following function to calculate this 'exclusive' zscore.
def zscore_exclusive(arr):
newl = []
for index, val in enumerate(x):
l = list(x)
val = l.pop(index)
arr_popped = np.array(l)
avg = np.mean(arr_popped)
stdev = np.std(arr_popped)
newl.append((val - avg) / stdev)
return np.array(newl)
How can I apply this custom function to each grouping (by query string)? Remember, I'd like to pop the currently evaluated element from the series.
CodePudding user response:
Given:
day query num_searches
0 1 abc 2
1 1 def 3
2 2 abc 6
3 3 abc 5
4 4 def 1
5 4 abc 3
6 5 abc 7
7 6 abc 8
8 7 abc 10
9 8 abc 1
Doing:
- Note!
- For
np.std
,ddof = 0
by default. - But for
pd.Series.std
,ddof = 1
by default.- You should be sure which one you want to use.
- For
z_score = lambda x: [(x[i]-x.drop(i).mean())/x.drop(i).std(ddof=0) for i in x.index]
df['z-score'] = df.groupby('query')['num_searches'].transform(z_score)
print(df)
Output:
day query num_searches z-score
0 1 abc 2 -1.319950
1 1 def 3 inf
2 2 abc 6 0.277350
3 3 abc 5 -0.092057
4 4 def 1 -inf
5 4 abc 3 -0.866025
6 5 abc 7 0.661438
7 6 abc 8 1.083862
8 7 abc 10 2.223782
9 8 abc 1 -1.877336