Home > front end >  A good method for comparing each row in dataframe to each row in another dataframe?
A good method for comparing each row in dataframe to each row in another dataframe?

Time:11-09

I'm currently in need for a better method for doing some calculations in a quite tedious fashion. And I'd like to change that, due to some of the dataframes that will be processed can have a large size.

I would like to do some comparisons for each row in one dataframe to each row in another dataframe. This by nature scales at x^2, but I was wondering if there were any good way of doing this?

df1 = pd.DataFrame([
    {"value": 1, "timestamp": "1970-01-01T00:00:00z"},
    {"value": 2, "timestamp": "1970-01-01T00:00:00z"},
    ...
])

df2 = pd.DataFrame([
    {"value": 3, "timestamp": "1970-01-01T00:00:03z"},
    {"value": 4, "timestamp": "1970-01-01T00:00:04z"},
    ...
])

With these two dataframes, I would like to be able to make calculations for 1-3, 1-4, 2-3 and 2-4

The calculations that needs to take place, involve comparing the columns for each 'row-pair'.My first method of doing this was just with the use of itertools' module function product. I.E.

for (_,x), (_,y) in product(X, Y):
    func(x, y)

This of course doesn't scale so nicely. In desperation I've come up with a new 'solution' that I can't imagine is more scalable

df1.apply(lambda x: df2.apply(lambda y: func(x, y), axis=1), axis=1)

This also doesn't scale well, in fact in the tests that I've done, it's actually slower.

This the main idea of what I'd like to do, and certainly not the optimal way of doing it..

df1 = pd.DataFrame([
    {"value": 1, "timestamp": "1970-01-01T00:00:01z"},
    {"value": 2, "timestamp": "1970-01-01T00:00:02z"},
    ...
])

df2 = pd.DataFrame([
    {"value": 3, "timestamp": "1970-01-01T00:00:03z"},
    {"value": 4, "timestamp": "1970-01-01T00:00:04z"},
    ...
])

def func(x, y, lst):
    lst.append(x["value"]   y["value"])

results = []

df1.apply(lambda x: df2.apply(lambda y: func(x, y, results), axis=1), axis=1)

print(results) # [4, 5, 5, 6, ...]

CodePudding user response:

Numpy ufunc.outer can be utilized to relevant ufunc function such as numpy.add, numpy.multiply, etc.

import numpy as np

np.add.outer(df1.values, df2.values).flatten()

array([4, 5, 5, 6], dtype=int64)

CodePudding user response:

merge cross get cartesian product:

df1.merge(df2, how='cross')

output:

    value_x value_y
0   1       3
1   1       4
2   2       3
3   2       4

then apply your func:

df1.merge(df2, how='cross').apply(lambda x: func(x['value_x'], x['value_y']), axis=1)



example

func : f(x, y) = xy

df1.merge(df2, how='cross').apply(lambda x: x['value_x'] * x['value_y'], axis=1)

output:

0    3
1    4
2    6
3    8
dtype: int64
  • Related