Home > Net >  Outer product on Pandas DataFrame rows
Outer product on Pandas DataFrame rows

Time:10-26

Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN. Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way

Example:

Frame 1:

label d1 d2 d3
a 1 2 3
b 4 5 6

Frame 2:

label d1 d2 d3
c 7 8 9
d 10 11 12

Result:

label_1 label_2 d1 d2 d3
a c 7 16 27
a d 10 22 36
b c 28 40 54
b d 40 55 72

I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.

CodePudding user response:

Let's do a cross merge first then mutiple the dn_x with dn_y

out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
       .join(out.filter(regex='d.*_x')
             .mul(out.filter(regex='d.*_y').values)
             .rename(columns=lambda col: col.split('_')[0])))
print(out)

  label_x label_y  d1  d2  d3
0       a       c   7  16  27
1       a       d  10  22  36
2       b       c  28  40  54
3       b       d  40  55  72

CodePudding user response:

first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:

mux = pd.MultiIndex.from_product([df1['label'], df2['label']])

df = (df1.set_index('label').reindex(mux, level=0)
         .mul(df2.set_index('label').reindex(mux, level=1))
         .rename_axis(['label1','label2'])
         .reset_index())
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72

Or solution with cross join:

df = (df1.rename(columns={'label':'label1'})
          .merge(df2.rename(columns={'label':'label2'}), 
                 how='cross',
                 suffixes=('_','')))

For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:

cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
  label1 label2  d1  d2  d3
0      a      c   7  16  27
1      a      d  10  22  36
2      b      c  28  40  54
3      b      d  40  55  72
  • Related