Search for similar words add the matching words to the new column-CodePudding

I would like to take both columns like col_one and similar_words. to be used to search for similar words You'll see that in col_one, for example, WPH has words like similar_words is [WP, W, WPH], if found, add the matching words to the new column.

Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names. with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A

my data

import pandas as pd

list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG',  'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']

data = {
  "col_one": ["ZEN", "WPH", "WICE", "YONG", "K", "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)

df['similar_words'] = df['col_one'].apply(lambda x: [c for c in list20 if c in x])
df

will look like this picture

Expected Output

CodePudding user response：

It looks like you want to extract the longest match from your list of words. You can achieve this with a regex in a single step by carefully designing the regex to match longer words first:

import re
regex = '|'.join(map(re.escape, sorted(list20, key=len, reverse=True)))

df['same_word'] = df['col_one'].str.extract(f'({regex})')

Output:

        col_one     similar_words same_word
0           ZEN             [ZEN]       ZEN
1           WPH      [WP, W, WPH]       WPH
2          WICE         [WICE, W]      WICE
3          YONG            [YONG]      YONG
4             K               [K]         K
5            XO              [XO]        XO
6           WIN          [W, WIN]       WIN
7            WP           [WP, W]        WP
8          WIIK      [K, W, WIIK]      WIIK
9        YGG-W1          [YGG, W]       YGG
10         W-W5               [W]         W
11       WINNER  [W, WINNER, WIN]    WINNER
12        YUASA           [YUASA]     YUASA
13          WGE          [WGE, W]       WGE
14          WFX          [W, WFX]       WFX
15          XPG             [XPG]       XPG
16        WHAUP   [W, WHA, WHAUP]     WHAUP
17          WHA          [W, WHA]       WHA
18  KCE13P2302A          [K, KCE]       KCE
19        OOP-R             [OOP]       OOP

CodePudding user response：

Iterate though similar_words array and find the length of matching words:

df["match_len"] = df.apply(lambda row: [len(sw) if sw in row["col_one"] else 0 for sw in row["similar_words"]], axis=1)

>>        col_one     similar_words  match_len
>> 0           ZEN             [ZEN]        [3]
>> 1           WPH      [WP, W, WPH]  [2, 1, 3]
>> 2          WICE         [WICE, W]     [4, 1]

Then find the index of the word with max length and return the word at that index:

df["same_word"] = df.apply(lambda row: row["similar_words"][row["match_len"].index(max(row["match_len"]))], axis=1)

>>         col_one     similar_words  match_len same_word
>> 0           ZEN             [ZEN]        [3]       ZEN
>> 1           WPH      [WP, W, WPH]  [2, 1, 3]       WPH
>> 2          WICE         [WICE, W]     [4, 1]      WICE