I have two dataframes. Here are their samples. dt1:
id val
1 smth11
1 smth12
2 smth21
2 smth22
2 smth23
... ...
dt2:
id val
1 blabla
2 bla2
2 bla3
... ...
i have a function which calculates a similarity score between strings (like "smth11" and "blabla" in this example) from 0 to 1: my_func
. For each value in the "val" column in the dt1 dataset, I want to count the number of values in the "val" column in the dt2 dataset that have a score greater than 0.7. Only the values that are in the same groups of the "id" column in both datasets are compared. So desired result should look like this:
id val count
1 smth11 2
1 smth12 2
2 smth21 5
2 smth22 7
2 smth23 3
... ...
The problem is that my actual datasets are huge (several thousand rows each). I wanted to know how I could do this in the most efficient way (perhaps doing the calculations in parallel?)
CodePudding user response:
I think that the following code should be pretty fast since all calculations are performed by numpy.
import pandas as pd
import numpy as np
import random
# Since the similarity function was not given,
# we'll use random.random to generate values
# between 0 and 1
random.seed(1)
a1 = np.array([
[1, 'smth11'],
[1, 'smth12'],
[2, 'smth21'],
[2, 'smth23'],
[2, 'smth24'],
])
df1 = pd.DataFrame(a1, columns = ['id','val1'])
a2 = np.array([
[1, 'blabla'],
[2, 'bla2'],
[2, 'bla3'],
])
df2 = pd.DataFrame(a2, columns = ['id','val2'])
# matrix merges the df's in such a way as to include
# all (useful) combinations of df1 and df2
matrix = df1.merge(df2, left_on='id', right_on='id')
# Here we add the 'similarity' column to the matrix df.
# You will need to modify the (smilarity) lambda function below.
# I.e. something like lambda row: <some fn of row['val1'] and row(['val2']>
matrix['similarity'] = matrix.apply(lambda row: random.random(), axis=1)
print('------ matrix with scores')
print(matrix)
# Finally we count cases with similarities > .7
counts = matrix.query("similarity > .7").groupby("val1").size()
print('------ counts')
print(counts)
print('NOTE: the type of "counts" is', type(counts))
Output:
------ matrix with scores
id val1 val2 similarity
0 1 smth11 blabla 0.134364
1 1 smth12 blabla 0.847434
2 2 smth21 bla2 0.763775
3 2 smth21 bla3 0.255069
4 2 smth23 bla2 0.495435
5 2 smth23 bla3 0.449491
6 2 smth24 bla2 0.651593
7 2 smth24 bla3 0.788723
------ counts
val1
smth12 1
smth21 1
smth24 1
dtype: int64
NOTE: the type of "counts" is <class 'pandas.core.series.Series'>
Please let us know how this code performs with your data.
CodePudding user response:
This is my second answer to this question. It uses the exact same method used in the previous answer but it adds:
- Programmatically generating arbitrarily large datasets.
- Performance measurement.
As it turns out, with datasets df1 and df2 having about 4000 rows each, the computation of "counts" takes about 0.68 seconds. This is on an Intel i5-4570 Quad Core 3.2GHz CPU.
The code, as it appears below, uses Python's (very fast) random.random() function to simulate the similarity calculation. Switching to the slower random.betavariate(alpha, beta) function increases the runtime from about 0.7 to about 1 second. See the comments marked "NOTE" to play woth this.
Code:
import pandas as pd
import numpy as np
import random
import time
def get_df1(group_count, min_size, max_size, population):
rows = []
for group in range(1, group_count 1):
row_count = random.randint(min_size, max_size)
ids = sorted(random.sample(range(1, population 1), row_count))
for i in range(row_count):
rows.append([group, f'smth-{group}-{ids[i]}'])
return pd.DataFrame(rows, columns = ['id','val1'])
def get_df2(group_count, min_size, max_size):
rows = []
for group in range(1, group_count 1):
row_count = random.randint(min_size, max_size)
for i in range(row_count):
rows.append([group, 'blablabla'])
return pd.DataFrame(rows, columns = ['id','val2'])
def simulate(group_count, min_size, max_size, population, sensitivity):
df1 = get_df1(group_count, min_size, max_size, population)
df2 = get_df2(group_count, min_size, max_size)
# Measure time from here...
start_time = time.time()
matrix = df1.merge(df2, left_on='id', right_on='id')
matrix['similarity'] = matrix.apply(
# NOTE: Using random.random() takes 0.680 seconds
lambda row: random.random(), axis=1)
# NOTE: Using random.betavariate(1, 5) takes 1.050 seconds
#lambda row: random.betavariate(1, 5), axis=1)
counts = matrix.query(f'similarity > {sensitivity}').groupby("val1").size()
seconds = time.time() - start_time
# ... to here
print('-' * 40, 'df1\n', df1)
print('-' * 40, 'df2\n', df2)
print('-' * 40, 'matrix\n', matrix)
print('-' * 40, 'counts\n', counts)
print('-------------------------- Summary')
print('df1 rows: ', len(df1))
print('df2 rows: ', len(df2))
print('matrix rows:', len(matrix))
print('counts rows:', len(counts))
print(f'--- {seconds:.3f} seconds ---')
random.seed(2)
simulate(100, 30, 50, 500, .7)
Output:
---------------------------------------- df1
id val1
0 1 smth-1-19
1 1 smth-1-44
2 1 smth-1-47
3 1 smth-1-82
4 1 smth-1-87
... ... ...
3917 100 smth-100-449
3918 100 smth-100-465
3919 100 smth-100-478
3920 100 smth-100-496
3921 100 smth-100-500
[3922 rows x 2 columns]
---------------------------------------- df2
id val2
0 1 blablabla
1 1 blablabla
2 1 blablabla
3 1 blablabla
4 1 blablabla
... ... ...
3903 100 blablabla
3904 100 blablabla
3905 100 blablabla
3906 100 blablabla
3907 100 blablabla
[3908 rows x 2 columns]
---------------------------------------- matrix
id val1 val2 similarity
0 1 smth-1-19 blablabla 0.150723
1 1 smth-1-19 blablabla 0.333073
2 1 smth-1-19 blablabla 0.592977
3 1 smth-1-19 blablabla 0.917483
4 1 smth-1-19 blablabla 0.119862
... ... ... ... ...
153482 100 smth-100-500 blablabla 0.645689
153483 100 smth-100-500 blablabla 0.595884
153484 100 smth-100-500 blablabla 0.697562
153485 100 smth-100-500 blablabla 0.704013
153486 100 smth-100-500 blablabla 0.342706
[153487 rows x 4 columns]
---------------------------------------- counts
val1
smth-1-109 13
smth-1-129 11
smth-1-138 10
smth-1-158 11
smth-1-185 5
..
smth-99-49 9
smth-99-492 8
smth-99-59 10
smth-99-95 11
smth-99-97 13
Length: 3922, dtype: int64
-------------------------- Summary
df1 rows: 3922
df2 rows: 3908
matrix rows: 153487
counts rows: 3922
--- 0.673 seconds ---