I'm currently in need for a better method for doing some calculations in a quite tedious fashion. And I'd like to change that, due to some of the dataframes that will be processed can have a large size.
I would like to do some comparisons for each row in one dataframe to each row in another dataframe. This by nature scales at x^2, but I was wondering if there were any good way of doing this?
df1 = pd.DataFrame([
{"value": 1, "timestamp": "1970-01-01T00:00:00z"},
{"value": 2, "timestamp": "1970-01-01T00:00:00z"},
...
])
df2 = pd.DataFrame([
{"value": 3, "timestamp": "1970-01-01T00:00:03z"},
{"value": 4, "timestamp": "1970-01-01T00:00:04z"},
...
])
With these two dataframes, I would like to be able to make calculations for 1-3, 1-4, 2-3 and 2-4
The calculations that needs to take place, involve comparing the columns for each 'row-pair'.My first method of doing this was just with the use of itertools' module function product. I.E.
for (_,x), (_,y) in product(X, Y):
func(x, y)
This of course doesn't scale so nicely. In desperation I've come up with a new 'solution' that I can't imagine is more scalable
df1.apply(lambda x: df2.apply(lambda y: func(x, y), axis=1), axis=1)
This also doesn't scale well, in fact in the tests that I've done, it's actually slower.
This the main idea of what I'd like to do, and certainly not the optimal way of doing it..
df1 = pd.DataFrame([
{"value": 1, "timestamp": "1970-01-01T00:00:01z"},
{"value": 2, "timestamp": "1970-01-01T00:00:02z"},
...
])
df2 = pd.DataFrame([
{"value": 3, "timestamp": "1970-01-01T00:00:03z"},
{"value": 4, "timestamp": "1970-01-01T00:00:04z"},
...
])
def func(x, y, lst):
lst.append(x["value"] y["value"])
results = []
df1.apply(lambda x: df2.apply(lambda y: func(x, y, results), axis=1), axis=1)
print(results) # [4, 5, 5, 6, ...]
CodePudding user response:
Numpy ufunc.outer can be utilized to relevant ufunc function such as numpy.add, numpy.multiply, etc.
import numpy as np
np.add.outer(df1.values, df2.values).flatten()
array([4, 5, 5, 6], dtype=int64)
CodePudding user response:
merge
cross
get cartesian product:
df1.merge(df2, how='cross')
output:
value_x value_y
0 1 3
1 1 4
2 2 3
3 2 4
then apply
your func
:
df1.merge(df2, how='cross').apply(lambda x: func(x['value_x'], x['value_y']), axis=1)
example
func
: f(x, y) = xy
df1.merge(df2, how='cross').apply(lambda x: x['value_x'] * x['value_y'], axis=1)
output:
0 3
1 4
2 6
3 8
dtype: int64