So I have 2 pandas dataframes with their dtypes
. I want to be able to apply a convex combination (So given some values x1 and x2 a convex combination is one where L is in [0,1] and L*x1 (1-L)*x2) to the none categorical features between all rows that have the same categorical features EXCEPT itself. Also there shouldn't be any duplicates(i.e. one row convex combo'd with another row multiple times). So for example:
Is taco? Count
yes 2
yes 5
yes 1
Where Is taco?
is dtype
category
and Count
is dtype
Int
. x1 and x2 can be a vector of numerical features, but in the above case it's just 2 different rows of Count
. There is only one categorical feature above which is Is taco?
and they're all the same so we do the convex combination between all rows. If L=0.5 it should return
idx Is taco? Count
0 yes 3.5
1 yes 1.5
2 yes 3
idx=0
was calculated by 1st and 2nd row. So 0.5 * 2 0.5 * 5 = 3.5. Then idx=1
calculated by 1st and 3rd row so 0.5 * (1 2) = 1.5. So as you can see the non-categorical features are combined via a convex combination. How can I do this with Pandas?
CodePudding user response:
Use itertools.combinations
:
from itertools import combinations
func = lambda x: np.sum(np.array(list(combinations(x, r=len(x)-1))) * 0.5, axis=1)
out = df.groupby('Is taco?')['Count'] \
.apply(func).explode().reset_index()
Output:
>>> out
Is taco? Count
0 yes 3.5
1 yes 1.5
2 yes 3.0
Another example:
df = pd.DataFrame({'Is taco?': ['no', 'no', 'no', 'yes', 'yes', 'yes', 'yes'],
'Count': [1, 3, 5, 3, 6, 9, 12]})
print(df)
# Output:
Is taco? Count
0 no 1
1 no 3
2 no 5
3 yes 3
4 yes 6
5 yes 9
6 yes 12
# After combinations
>>> out
Is taco? Count
Is taco? Count
0 no 2.0
1 no 3.0
2 no 4.0
3 yes 9.0
4 yes 10.5
5 yes 12.0
6 yes 13.5