I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature()
and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in. However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
CodePudding user response:
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00