Calculate all possible combinations of column totals using pyspark.pandas-CodePudding

I have the following code which takes columns in my pandas df and calculates all combination of totals minus duplicates:

import itertools as it
import pandas as pd
df = pd.DataFrame({'a': [3,4,5,6,3], 'b': [5,7,1,0,5], 'c':[3,4,2,1,3], 'd':[2,0,1,5,9]})

orig_cols = df.columns
for r in range(2, df.shape[1]   1):
    for cols in it.combinations(orig_cols, r):
        df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
        
df

Which generates the desired df:

To take advantage of distributed computing I want to run this same code but using pyspark.pandas. I convert the df to spark and apply the same code....

import itertools as it
import pandas as pd
import pyspark.pandas as ps

df = pd.DataFrame({'a': [3,4,5,6,3], 'b': [5,7,1,0,5], 'c':[3,4,2,1,3], 'd':[2,0,1,5,9]})
dfs = ps.from_pandas(df)      # convert from pandas to pyspark

orig_cols = dfs.columns
for r in range(2, dfs.shape[1]   1):
    for cols in it.combinations(orig_cols, r):
        dfs["_".join(cols)] = dfs.loc[:, cols].sum(axis=1)
        
dfs

but I am getting an error message:

IndexError: tuple index out of range

Why is the code not working? What change do I need to make so it can work in pyspark?

CodePudding user response：

The error fixed by converting tuple to list. Try this:

import itertools as it
import pandas as pd
import pyspark.pandas as ps

ps.set_option('compute.ops_on_diff_frames', True)


df = pd.DataFrame({'a': [3,4,5,6,3], 'b': [5,7,1,0,5], 'c':[3,4,2,1,3], 'd':[2,0,1,5,9]})
dfs = ps.from_pandas(df)      # convert from pandas to pyspark

orig_cols = dfs.columns
for r in range(2, dfs.shape[1]   1):
    for cols in it.combinations(orig_cols, r):
        dfs["_".join(list(cols))] = dfs.loc[:, list(cols)].sum(axis=1)
        
dfs