I would like to generate all 2-way (and possibly 3-way) "multiplicative" combinations (i.e., column1 x column2, column2 x column3, column1 x column3, column1 x column 2 x column3, etc) of a pandas dataframe with about 100 columns (and 500K rows). I plan to evaluate these combinations for identifying high performing 'feature interactions' in a predictive model.
Thus far, my attempts (at either repurposing existing stackoverflow suggestions or other online materials) to generate the combinations have not been successful. Here is my minimal example with a very simple dataframe and the code I am using:
df = pd.DataFrame({'age':['10','20'], 'height':['5', '6'], 'weight':['100','150']})
sample_list = df.columns.to_list()
output = list(combinations(sample_list, 2))
pd.DataFrame(df[x].prod(axis = 1) for x in output)
My above code yields a "key error: ('age', 'height'). The final output should contain all 2-way (and potentially 3-way) 'multiplicative' combinations only (i.e., not include the original columns).
Can somebody please guide me to the solution? Also, how would one modify the code to generate all 3-way combinations? Any suggestions to optimize for limited RAM (~30GB) are also appreciated.
CodePudding user response:
Firstly, the values of your dataframe should be int.
Secondly, prod
returns the product of values along a given axis. IIUC you want the product of two columns:
df = pd.DataFrame({'age':[10,20], 'height':[5, 6], 'weight':[100,150]})
sample_list = df.columns.to_list()
output = list(combinations(sample_list, 2))
for x,y in output:
df[f'{x}*{y}']= df[x] * df[y]
print(df)
Output:
age height weight age*height age*weight height*weight
0 10 5 100 50 1000 500
1 20 6 150 120 3000 900
Edit:
This solution can be generalized for products of any length, just replace your loop with:
from functools import reduce
for tup in output:
df['*'.join(tup)]= reduce(lambda a,b: a*b, [df[i] for i in tup])
CodePudding user response:
Consider a chained multiplication approach with reduce
to build a list of Series to ultimately concat
:
from functools import reduce
from itertools import combinations
import pandas as pd
df = pd.DataFrame({
'age':[10, 20], 'height':[5, 6], 'weight':[100,150]
})
def chain_mul(cols):
col_name = "*".join(cols)
series_dict = df[list(cols)].to_dict('series')
col_prod = reduce(lambda x,y: x.mul(y), series_dict.values())
return pd.Series(col_prod, name=col_name)
# BUILD COLUMN COMBINATIONS (DUOS AND TRIOS)
sample_list = df.columns.to_list()
combns = (
list(combinations(sample_list, 2))
list(combinations(sample_list, 3))
)
# BUILD LIST OF COLUMN PRODUCTS
series_list = [chain_mul(cols) for cols in combns]
# HORIZONTAL JOIN
interactions_df = pd.concat(series_list, axis=1)
print(interactions_df)
# age*height age*weight height*weight age*height*weight
# 0 50 1000 500 5000
# 1 120 3000 900 18000