I have a pandas dataframe "d1" that has "n" columns and "m" rows. I have another dataframe "d2" with one row. My task is to compare the best matching row in d1 when compared with d2 and get the count of matched columns in the d1.
d2:
a b c d e
0 machine machine learning spark java python
d1 dataframe is also similar with more rows
a b c d e
0 machine deep learning bigdata java python
1 cnn nlp bigdata spark java
when I compare d1 with d2, the best match row is index '0' as it has more matches with d2
As of now, I am iterating the d1 dataframe and comparing it with d2 which is taking lot of time. Below is the code
for index, row in d1.iterrows():
score = len(set(row.values.tolist()) & set(d2.values[0].tolist()))
if score >= cutoff_score and score > max_score:
max_score = sim_score
Is there any optimal way to do this comparision in O(1) or O(logn)
CodePudding user response:
iterrows
is a trap. It's basically never the best option, and needlessly overcomplicates things by deconstructing a perfectly good dataframe.
d2_set = set(df2.loc[0])
out = df1.apply(lambda x: len(set(x) & d2_set), axis=1, raw=True)
print(out)
print(out.idxmax())
Output:
0 3
1 2
dtype: int64
0
Given 2,000,000 rows:
a b c d e
0 machine deep learning bigdata java python
1 cnn nlp bigdata spark java
2 machine deep learning bigdata java python
3 cnn nlp bigdata spark java
4 machine deep learning bigdata java python
... ... ... ... ... ...
1999995 cnn nlp bigdata spark java
1999996 machine deep learning bigdata java python
1999997 cnn nlp bigdata spark java
1999998 machine deep learning bigdata java python
1999999 cnn nlp bigdata spark java
[2000000 rows x 5 columns]
It only takes 4 seconds, and then max/idxmax only takes a couple ms:
>>> t = time();out = df1.apply(lambda x: len(set(x) & d2_set), axis=1, raw=True);time()-t;
4.035001277923584
>>> t = time();out.max();out.idxmax();time()-t;
0.0015015602111816406
Your method took about 60 seconds... once I actually got it working, you're missing a lot.