With sample data and code below, I'm trying to groupby year-month and find top K columns with smallest std values inside all the columns endswith _values
:
import pandas as pd
import numpy as np
from statistics import stdev
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
k = 3 # set k as 3
value_cols = df.columns[df.columns.str.endswith('_values')]
def find_topK_smallest_std(group):
std = stdev(group[value_cols])
cols = std.nsmallest(k).index
out_cols = [f'std_{i 1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
But it raises a type error, how could I fix this issue? Sincere thanks at advance.
Out:
TypeError: can't convert type 'str' to numerator/denominator
Reference link:
Groupby year-month and find top N smallest values columns in Python
CodePudding user response:
In your solution add DataFrame.apply
for stdev
per columns, if need per rows add axis=1
:
def find_topK_smallest_std(group):
#procssing per columns
std = group[value_cols].apply(stdev)
cols = std.nsmallest(k).index
out_cols = [f'std_{i 1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df = df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
print (df)
std_1 std_2 std_3
2013-02-26 7.333694 3.126731 1.389472
2013-02-27 7.529254 7.843101 6.621605
2013-02-28 6.165574 5.612724 0.866300
2013-03-01 5.693051 3.711608 4.521452
2013-03-02 7.322250 4.763135 5.178144
... ... ...
2013-05-22 8.795736 3.864723 6.316478
2013-05-23 7.959282 5.140268 1.839659
2013-05-24 5.412016 5.890717 9.081583
2013-05-25 1.088414 1.610210 9.016004
2013-05-26 4.930571 6.893207 2.338785
[90 rows x 3 columns]