Outer product on Pandas DataFrame rows-CodePudding

Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN. Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way

Example:

Frame 1:

label	d1	d2	d3
a	1	2	3
b	4	5	6

Frame 2:

label	d1	d2	d3
c	7	8	9
d	10	11	12

Result:

label_1	label_2	d1	d2	d3
a	c	7	16	27
a	d	10	22	36
b	c	28	40	54
b	d	40	55	72

I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.

CodePudding user response：

Let's do a cross merge first then mutiple the dn_x with dn_y

out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
       .join(out.filter(regex='d.*_x')
             .mul(out.filter(regex='d.*_y').values)
             .rename(columns=lambda col: col.split('_')[0])))

print(out)

  label_x label_y  d1  d2  d3
0       a       c   7  16  27
1       a       d  10  22  36
2       b       c  28  40  54
3       b       d  40  55  72

CodePudding user response：

first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:

mux = pd.MultiIndex.from_product([df1['label'], df2['label']])

df = (df1.set_index('label').reindex(mux, level=0)
         .mul(df2.set_index('label').reindex(mux, level=1))
         .rename_axis(['label1','label2'])
         .reset_index())
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72

Or solution with cross join:

df = (df1.rename(columns={'label':'label1'})
          .merge(df2.rename(columns={'label':'label2'}), 
                 how='cross',
                 suffixes=('_','')))

For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:

cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72