Home > other >  Cartesian product for values in a single row pandas df
Cartesian product for values in a single row pandas df

Time:05-11

I have a df with 14 columns and 20,000 rows. I would like to create a two column dataframe that represents each unique pairing for the data entries within each single row. Example:

#sample df:
​data = {'first':  ['red', 'blue', 'yellow'],
        'second': ['blue', 'pink', 'orange'],
         'third': ['green', 'grey', None]}
df = pd.DataFrame(data)

df

    first   second  third
0   red     blue    green
1   blue    pink    grey
2   yellow  orange  None

for this df input my desired output would be:


    pairA   pairB
0   red     blue
1   red     green
2   blue    green
3   blue    pink
4   blue    grey
5   pink    grey
6   yellow  orange
7   yellow  None
8   orange  None

I have tried to use itertools product, but have only made that work column-wise for two columns. I believe a for loop would take way too long with the size of the data. Is there a pandas way to do this?

CodePudding user response:

Try this:

from itertools import combinations

combo = df.apply(lambda row: list(combinations(row, 2)), axis=1).explode().to_list()
pd.DataFrame(combo, columns=["pairA", "pairB"])

CodePudding user response:

You have many ways,

if order does not matter you can use:

from itertools import combinations
pd.DataFrame([c for x in df.agg(set, axis=1) for c in combinations(x, r=2)],
             columns=['pairA', 'pairB'])

If order is important:

from itertools import combinations
pd.DataFrame([c for x in df.apply(lambda r: tuple(pd.Series.unique(r)), axis=1)
              for c in combinations(x, r=2)],
             columns=['pairA', 'pairB'])

output:

    pairA   pairB
0     red    blue
1     red   green
2    blue   green
3    blue    pink
4    blue    grey
5    pink    grey
6  yellow  orange
7  yellow    None
8  orange    None
  • Related