I would like to take both columns like col_one and similar_words. to be used to search for similar words You'll see that in col_one, for example, WPH has words like similar_words is [WP, W, WPH], if found, add the matching words to the new column.
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names. with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K", "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
df['similar_words'] = df['col_one'].apply(lambda x: [c for c in list20 if c in x])
df
will look like this picture
Expected Output
CodePudding user response:
It looks like you want to extract the longest match from your list of words. You can achieve this with a regex in a single step by carefully designing the regex to match longer words first:
import re
regex = '|'.join(map(re.escape, sorted(list20, key=len, reverse=True)))
df['same_word'] = df['col_one'].str.extract(f'({regex})')
Output:
col_one similar_words same_word
0 ZEN [ZEN] ZEN
1 WPH [WP, W, WPH] WPH
2 WICE [WICE, W] WICE
3 YONG [YONG] YONG
4 K [K] K
5 XO [XO] XO
6 WIN [W, WIN] WIN
7 WP [WP, W] WP
8 WIIK [K, W, WIIK] WIIK
9 YGG-W1 [YGG, W] YGG
10 W-W5 [W] W
11 WINNER [W, WINNER, WIN] WINNER
12 YUASA [YUASA] YUASA
13 WGE [WGE, W] WGE
14 WFX [W, WFX] WFX
15 XPG [XPG] XPG
16 WHAUP [W, WHA, WHAUP] WHAUP
17 WHA [W, WHA] WHA
18 KCE13P2302A [K, KCE] KCE
19 OOP-R [OOP] OOP
CodePudding user response:
Iterate though similar_words
array and find the length of matching words:
df["match_len"] = df.apply(lambda row: [len(sw) if sw in row["col_one"] else 0 for sw in row["similar_words"]], axis=1)
>> col_one similar_words match_len
>> 0 ZEN [ZEN] [3]
>> 1 WPH [WP, W, WPH] [2, 1, 3]
>> 2 WICE [WICE, W] [4, 1]
Then find the index of the word with max length and return the word at that index:
df["same_word"] = df.apply(lambda row: row["similar_words"][row["match_len"].index(max(row["match_len"]))], axis=1)
>> col_one similar_words match_len same_word
>> 0 ZEN [ZEN] [3] ZEN
>> 1 WPH [WP, W, WPH] [2, 1, 3] WPH
>> 2 WICE [WICE, W] [4, 1] WICE