Home > front end >  Generate all multiplicative (product) combinations of columns in a pandas dataframe
Generate all multiplicative (product) combinations of columns in a pandas dataframe

Time:06-19

I would like to generate all 2-way (and possibly 3-way) "multiplicative" combinations (i.e., column1 x column2, column2 x column3, column1 x column3, column1 x column 2 x column3, etc) of a pandas dataframe with about 100 columns (and 500K rows). I plan to evaluate these combinations for identifying high performing 'feature interactions' in a predictive model.

Thus far, my attempts (at either repurposing existing stackoverflow suggestions or other online materials) to generate the combinations have not been successful. Here is my minimal example with a very simple dataframe and the code I am using:

df = pd.DataFrame({'age':['10','20'], 'height':['5', '6'], 'weight':['100','150']})

sample_list = df.columns.to_list()
output = list(combinations(sample_list, 2))

pd.DataFrame(df[x].prod(axis = 1) for x in output)

My above code yields a "key error: ('age', 'height'). The final output should contain all 2-way (and potentially 3-way) 'multiplicative' combinations only (i.e., not include the original columns). enter image description here

Can somebody please guide me to the solution? Also, how would one modify the code to generate all 3-way combinations? Any suggestions to optimize for limited RAM (~30GB) are also appreciated.

CodePudding user response:

Firstly, the values of your dataframe should be int. Secondly, prod returns the product of values along a given axis. IIUC you want the product of two columns:

df = pd.DataFrame({'age':[10,20], 'height':[5, 6], 'weight':[100,150]})

sample_list = df.columns.to_list()
output = list(combinations(sample_list, 2))

for x,y  in output:
    df[f'{x}*{y}']= df[x] * df[y]

print(df)

Output:

   age  height  weight  age*height  age*weight  height*weight
0   10       5     100          50        1000            500
1   20       6     150         120        3000            900
Edit:

This solution can be generalized for products of any length, just replace your loop with:

from functools import reduce
for tup in output:
    df['*'.join(tup)]= reduce(lambda a,b: a*b, [df[i] for i in tup])

CodePudding user response:

Consider a chained multiplication approach with reduce to build a list of Series to ultimately concat:

from functools import reduce
from itertools import combinations
import pandas as pd

df = pd.DataFrame({
    'age':[10, 20], 'height':[5, 6], 'weight':[100,150]
})

def chain_mul(cols):
    col_name = "*".join(cols)    
    series_dict = df[list(cols)].to_dict('series')
    col_prod = reduce(lambda x,y: x.mul(y), series_dict.values())
    return pd.Series(col_prod, name=col_name)    

# BUILD COLUMN COMBINATIONS (DUOS AND TRIOS)
sample_list = df.columns.to_list()
combns = (
    list(combinations(sample_list, 2))  
    list(combinations(sample_list, 3))
)

# BUILD LIST OF COLUMN PRODUCTS
series_list = [chain_mul(cols) for cols in combns]

# HORIZONTAL JOIN
interactions_df = pd.concat(series_list, axis=1)

print(interactions_df)
#    age*height  age*weight  height*weight  age*height*weight
# 0          50        1000            500               5000
# 1         120        3000            900              18000
  • Related