How to count the number of high score values by group in the most efficient way?-CodePudding

I have two dataframes. Here are their samples. dt1:

id     val
1      smth11
1      smth12
2      smth21
2      smth22
2      smth23
...    ...

dt2:

id     val
1      blabla
2      bla2
2      bla3
...    ...

i have a function which calculates a similarity score between strings (like "smth11" and "blabla" in this example) from 0 to 1: my_func. For each value in the "val" column in the dt1 dataset, I want to count the number of values in the "val" column in the dt2 dataset that have a score greater than 0.7. Only the values that are in the same groups of the "id" column in both datasets are compared. So desired result should look like this:

id     val        count
1      smth11      2
1      smth12      2
2      smth21      5
2      smth22      7
2      smth23      3
...    ...

The problem is that my actual datasets are huge (several thousand rows each). I wanted to know how I could do this in the most efficient way (perhaps doing the calculations in parallel?)

CodePudding user response：

I think that the following code should be pretty fast since all calculations are performed by numpy.

import pandas as pd
import numpy as np
import random

# Since the similarity function was not given,
# we'll use random.random to generate values
# between 0 and 1
random.seed(1)

a1 = np.array([
        [1, 'smth11'],
        [1, 'smth12'],
        [2, 'smth21'],
        [2, 'smth23'],
        [2, 'smth24'],
    ])
df1 = pd.DataFrame(a1, columns = ['id','val1'])

a2 = np.array([
        [1, 'blabla'],
        [2, 'bla2'],
        [2, 'bla3'],
    ])
df2 = pd.DataFrame(a2, columns = ['id','val2'])

#  matrix merges the df's in such a way as to include
#  all (useful) combinations of df1 and df2
matrix = df1.merge(df2, left_on='id', right_on='id')

#  Here we add the 'similarity' column to the matrix df.
#  You will need to modify the (smilarity) lambda function below.
#  I.e. something like lambda row: <some fn of row['val1'] and row(['val2']>
matrix['similarity'] = matrix.apply(lambda row: random.random(), axis=1)
print('------ matrix with scores')
print(matrix)

#  Finally we count cases with similarities > .7
counts = matrix.query("similarity > .7").groupby("val1").size()
print('------ counts')
print(counts)
print('NOTE: the type of "counts" is', type(counts))

Output:

------ matrix with scores
  id    val1    val2  similarity
0  1  smth11  blabla    0.134364
1  1  smth12  blabla    0.847434
2  2  smth21    bla2    0.763775
3  2  smth21    bla3    0.255069
4  2  smth23    bla2    0.495435
5  2  smth23    bla3    0.449491
6  2  smth24    bla2    0.651593
7  2  smth24    bla3    0.788723
------ counts
val1
smth12    1
smth21    1
smth24    1
dtype: int64
NOTE: the type of "counts" is <class 'pandas.core.series.Series'>

Please let us know how this code performs with your data.

CodePudding user response：

This is my second answer to this question. It uses the exact same method used in the previous answer but it adds:

Programmatically generating arbitrarily large datasets.
Performance measurement.

As it turns out, with datasets df1 and df2 having about 4000 rows each, the computation of "counts" takes about 0.68 seconds. This is on an Intel i5-4570 Quad Core 3.2GHz CPU.

The code, as it appears below, uses Python's (very fast) random.random() function to simulate the similarity calculation. Switching to the slower random.betavariate(alpha, beta) function increases the runtime from about 0.7 to about 1 second. See the comments marked "NOTE" to play woth this.

Code:

import pandas as pd
import numpy as np
import random
import time

def get_df1(group_count, min_size, max_size, population):
    rows = []
    for group in range(1, group_count 1):
        row_count = random.randint(min_size, max_size)
        ids = sorted(random.sample(range(1, population 1), row_count))
        for i in range(row_count):
            rows.append([group, f'smth-{group}-{ids[i]}'])
    return pd.DataFrame(rows, columns = ['id','val1'])

def get_df2(group_count, min_size, max_size):
    rows = []
    for group in range(1, group_count 1):
        row_count = random.randint(min_size, max_size)
        for i in range(row_count):
            rows.append([group, 'blablabla'])
    return pd.DataFrame(rows, columns = ['id','val2'])

def simulate(group_count, min_size, max_size, population, sensitivity):
    df1 = get_df1(group_count, min_size, max_size, population)
    df2 = get_df2(group_count, min_size, max_size)

    # Measure time from here...
    start_time = time.time()
    matrix = df1.merge(df2, left_on='id', right_on='id')
    matrix['similarity'] = matrix.apply(
        #  NOTE: Using random.random() takes 0.680 seconds
        lambda row: random.random(), axis=1)
        #  NOTE: Using random.betavariate(1, 5) takes 1.050 seconds
        #lambda row: random.betavariate(1, 5), axis=1)
    counts = matrix.query(f'similarity > {sensitivity}').groupby("val1").size()
    seconds = time.time() - start_time
    # ... to here

    print('-' * 40, 'df1\n', df1)
    print('-' * 40, 'df2\n', df2)
    print('-' * 40, 'matrix\n', matrix)
    print('-' * 40, 'counts\n', counts)

    print('-------------------------- Summary')
    print('df1 rows:   ', len(df1))
    print('df2 rows:   ', len(df2))
    print('matrix rows:', len(matrix))
    print('counts rows:', len(counts))
    print(f'--- {seconds:.3f} seconds ---')

random.seed(2)
simulate(100, 30, 50, 500, .7)

Output:

---------------------------------------- df1
        id          val1
0       1     smth-1-19
1       1     smth-1-44
2       1     smth-1-47
3       1     smth-1-82
4       1     smth-1-87
...   ...           ...
3917  100  smth-100-449
3918  100  smth-100-465
3919  100  smth-100-478
3920  100  smth-100-496
3921  100  smth-100-500

[3922 rows x 2 columns]
---------------------------------------- df2
        id       val2
0       1  blablabla
1       1  blablabla
2       1  blablabla
3       1  blablabla
4       1  blablabla
...   ...        ...
3903  100  blablabla
3904  100  blablabla
3905  100  blablabla
3906  100  blablabla
3907  100  blablabla

[3908 rows x 2 columns]
---------------------------------------- matrix
          id          val1       val2  similarity
0         1     smth-1-19  blablabla    0.150723
1         1     smth-1-19  blablabla    0.333073
2         1     smth-1-19  blablabla    0.592977
3         1     smth-1-19  blablabla    0.917483
4         1     smth-1-19  blablabla    0.119862
...     ...           ...        ...         ...
153482  100  smth-100-500  blablabla    0.645689
153483  100  smth-100-500  blablabla    0.595884
153484  100  smth-100-500  blablabla    0.697562
153485  100  smth-100-500  blablabla    0.704013
153486  100  smth-100-500  blablabla    0.342706

[153487 rows x 4 columns]
---------------------------------------- counts
 val1
smth-1-109     13
smth-1-129     11
smth-1-138     10
smth-1-158     11
smth-1-185      5
               ..
smth-99-49      9
smth-99-492     8
smth-99-59     10
smth-99-95     11
smth-99-97     13
Length: 3922, dtype: int64
-------------------------- Summary
df1 rows:    3922
df2 rows:    3908
matrix rows: 153487
counts rows: 3922
--- 0.673 seconds ---