How to check if a column has a word based on words from another column with three different conditio-CodePudding

Input:

import pandas as pd

df_input = pd.DataFrame({'Keyword': {0: 'apple banana, orange',
  1: 'apple orange ?banana "',
  2: 'potato, piercing pot hole',
  3: 'armor hard known'},
 'Returns': {0: 'Fruit; Banana Vendor',
  1: 'Blendor :Kaka Orange',
  2: 'piercing Fruit Banana takes a lot',
  3: 'bullet jacket gun'}})

df_input

For every word in Keyword column,

if any of it appears in Returns column, Score = 1
if none of it appears in Returns column, Score = 0
if any of it appears in first half of the words in Returns column, Score_before = 1
if any of it appears in second half of the words in Returns column, Score_after = 1

Output:

import pandas as pd

df_output = pd.DataFrame({'Keyword': {0: 'apple banana, orange',
  1: 'apple orange ?banana "',
  2: 'potato, piercing pot hole',
  3: 'armor hard known'},
 'Returns': {0: 'Fruit; Banana Vendor',
  1: 'Blendor :Kaka Orange',
  2: 'piercing Fruit Banana takes a lot',
  3: 'bullet jacket gun'},
 'Score': {0: 1, 1: 1, 2: 1, 3: 0},
 'Score_before': {0: 0, 1: 0, 2: 1, 3: 0},
 'Score_after': {0: 0, 1: 1, 2: 0, 3: 0}})

df_output

Original data frame has a million rows, how do I even tokenize the words efficiently? Should I use string operations instead? (I've used from nltk.tokenize import word_tokenize before, but how to apply it on the whole data frame if that's the way?)

Edit: My custom function for tokenization:

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
import string

def tokenize(s):
  translator = str.maketrans('', '', string.punctuation)
  s = s.translate(translator)
  s = word_tokenize(s)
  return s

tokenize('Finity: Stocks%, Direct$ MF, ETF')

CodePudding user response：

As you have as many conditions as you have rows, you need to loop here.

You can use a custom function:

import re
def get_scores(s1, s2):
    words1 = set(re.findall('\w ', s1.casefold()))
    words2 = re.findall('\w ', s2.casefold())
    n = len(words2)//2
    half1 = words2[:n]
    half2 = words2[n:]
    score_before = any(w in words1 for w in half1)
    score_after = any(w in words1 for w in half2)
    score = score_before or score_after
    return (score, score_before, score_after)

df2 = pd.DataFrame([get_scores(s1, s2) for s1, s2 in
                    zip(df_input['Keyword'], df_input['Returns'])],
                    dtype=int, columns=['Score', 'Score_before', 'Score_after'])

out = df_input.join(df2)

output:

                     Keyword                            Returns  Score  Score_before  Score_after
0       apple banana, orange               Fruit; Banana Vendor      1             0            1
1     apple orange ?banana "               Blendor :Kaka Orange      1             0            1
2  potato, piercing pot hole  piercing Fruit Banana takes a lot      1             1            0
3           armor hard known                  bullet jacket gun      0             0            0