I have 2 columns. for each element in a column col2
I want to search in col1
to see which element of col2
has the most common token with the element of col1
?
Example:
key col1 col2
ab 'summer hot, best friend, not possible, apple, let it go' "let be hot"
cd 'do it better', 'I am sa' "I need to go"
fg 'my best post, fun sunday' "it's great"
output
key col1 col2 match
ab 'summer hot, best friend, not possible, apple, let it go' "let be hot" "let it go"
cd 'do it better, I am sa' "I need it do sa" "I am sa"
fg 'my best post, fun sunday' "it's great" None
the first row has a common token with both summer hot
and let it go
but it dosn't matter which one is selected.
I did the following but i am looking for a more efficient method:
name_norm=[]
for i in range(len(df)):
name_max_tok=''
max_tok=0
for l in df['col1'].values[i].split(' '):
count=0
for k in df['col2'].values[i].split(' '):
if k in l:
count=count 1
if max_tok<count:
max_tok=count
name_max_tok=l
name_norm.append(name_max_tok)
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.DataFrame([
['ab', 'summer hot, best friend, not possible, apple, let it go', "let be hot"],
['cd', 'do it better, I am sa', "I need to go"],
['fg', 'my best post, fun sunday' , "it's great"],
], columns=['key', 'col1', 'col2'])
df["col2_set"] = df["col2"].apply(lambda x: set(x.split(" ")))
def setter(x):
data = x.col1.split(",")
res = np.array([len(x.col2_set.intersection(y.split(" "))) for y in data])
if res.sum() == 0:
return None
else:
return data[res.argmax()]
df['match'] = df.apply(lambda x: setter(x), axis=1)
df.drop(columns=['col2_set'], inplace=True)
df