pandas match the best row with the given list-CodePudding

I have a pandas dataframe "d1" that has "n" columns and "m" rows. I have another dataframe "d2" with one row. My task is to compare the best matching row in d1 when compared with d2 and get the count of matched columns in the d1.

d2:

         a                 b      c     d       e
0  machine  machine learning  spark  java  python

d1 dataframe is also similar with more rows

         a              b        c      d       e
0  machine  deep learning  bigdata   java  python
1      cnn            nlp  bigdata  spark    java

when I compare d1 with d2, the best match row is index '0' as it has more matches with d2

As of now, I am iterating the d1 dataframe and comparing it with d2 which is taking lot of time. Below is the code

for index, row in d1.iterrows():
     score = len(set(row.values.tolist()) & set(d2.values[0].tolist()))
     if score >= cutoff_score and score > max_score:
            max_score = sim_score

Is there any optimal way to do this comparision in O(1) or O(logn)

CodePudding user response：

iterrows is a trap. It's basically never the best option, and needlessly overcomplicates things by deconstructing a perfectly good dataframe.

d2_set = set(df2.loc[0])
out = df1.apply(lambda x: len(set(x) & d2_set), axis=1, raw=True)
print(out)
print(out.idxmax())

Output:

0    3
1    2
dtype: int64

0

Given 2,000,000 rows:

               a              b        c      d       e
0        machine  deep learning  bigdata   java  python
1            cnn            nlp  bigdata  spark    java
2        machine  deep learning  bigdata   java  python
3            cnn            nlp  bigdata  spark    java
4        machine  deep learning  bigdata   java  python
...          ...            ...      ...    ...     ...
1999995      cnn            nlp  bigdata  spark    java
1999996  machine  deep learning  bigdata   java  python
1999997      cnn            nlp  bigdata  spark    java
1999998  machine  deep learning  bigdata   java  python
1999999      cnn            nlp  bigdata  spark    java

[2000000 rows x 5 columns]

It only takes 4 seconds, and then max/idxmax only takes a couple ms:

>>> t = time();out = df1.apply(lambda x: len(set(x) & d2_set), axis=1, raw=True);time()-t;
4.035001277923584

>>> t = time();out.max();out.idxmax();time()-t;
0.0015015602111816406

Your method took about 60 seconds... once I actually got it working, you're missing a lot.