We have a given list:
list_of_versions = ['apple II' ,'apple', 'apple 1' , 'HD APPLE','apple 3.5', 'adventures of apple' , 'apple III','orange 2' ,'300mhz apple', '300-orange II' , 'orange II HD' , 'orange II tvx', 'orange 2' , 'HD berry-vol 2', 'berry II', 'berry 2', 'berry VI', 'berry 1', 'berry II' ,'berry' ,'II Berry']
How can I find the main word of each string? For example:
word | main
--------------------------
apple II |apple
val/apple |apple
apple 1 |apple
HD APPLE |apple
apple 3.5 |apple
adventures of apple |apple
apple III |apple
300mhz apple |apple
orange 2 |orange
300-orange II |orange
orange II HD |orange
/orange/II-tvx |orange
orange 2 |orange
HD berry-vol 2 |berry
berry-II |berry
-berry-2 |berry
(berry) VI |berry
berry 1 |berry
berry II |berry
berry 2022 B8 |berry
II Berry-hd |berry
22 Berry II |berry
Berry 6.8.9 |berry
Important points:
I can not create the main word list which contains three main words (apple, orange, berry). Because the list will be updated with new main words. So we will never know what is the new words.
there is no limit to versions. At some point, we can see something like 'apple XII' or 'GB-HD berry 2.4' so version value can be everything. (in case you want to create a stopword list)
Nice to have (but it is not mandatory)--> Adding another column as well which shows the version. i.e:
word | main | version
-----------------------------------
apple II |apple | II
val/apple |apple | NULL
apple 1 |apple | 1
HD APPLE |apple | HD
apple 3.5 |apple | 3.5
apple III |apple | III
300mhz apple II |apple | II
orange 2 |orange | 2
300-orange II |orange | II
orange II HD |orange | II HD
/orange/II-tvx |orange | II tvx
orange 2 |orange | 2
HD berry-vol 2 |berry | 2 HD
berry-II |berry | II
-berry-2 |berry | 2
(berry) VI |berry | VI
berry 1 |berry | 1
berry II |berry | II
berry 2022 |berry | NULL
II Berry-hd |berry | II HD
22 Berry |berry | 22
Berry 6.8.9 |berry | 6.8.9
CodePudding user response:
I propose following heursitic for your task: find longest sequence of letters, which can be implemented using re
module following way
import re
list_of_versions = ['apple II' ,'apple', 'apple 1' , 'HD APPLE','apple 3.5', 'apple III','orange 2' ,'300mhz apple', '300-orange II' , 'orange II HD' , 'orange II tvx', 'orange 2' , 'HD berry-vol 2', 'berry II', 'berry 2', 'berry VI', 'berry 1', 'berry II' ,'berry' ,'II Berry']
def get_main(string):
return max(re.findall(r'[A-Za-z] ',string),key=len)
for version in list_of_versions:
print(version,'|',get_main(version))
output
apple II | apple
apple | apple
apple 1 | apple
HD APPLE | APPLE
apple 3.5 | apple
apple III | apple
orange 2 | orange
300mhz apple | apple
300-orange II | orange
orange II HD | orange
orange II tvx | orange
orange 2 | orange
HD berry-vol 2 | berry
berry II | berry
berry 2 | berry
berry VI | berry
berry 1 | berry
berry II | berry
berry | berry
II Berry | Berry
Warning: this solution is limited to ASCII letters and was prepared using your example data only, please test it with all data you have access to in order to detect if does return what you want frequently enough for your use case.
CodePudding user response:
As suggested in the comment, you can get the longest string:
df['main'] = (df['words']
.str.extractall('([a-zA-Z] )')
.sort_values(by=0, key=lambda x: x.str.len())
.groupby(level=0).last()
[0].str.lower() # optional
)
output:
words main
0 apple II apple
1 apple apple
2 apple 1 apple
3 HD APPLE apple
4 apple 3.5 apple
5 apple III apple
6 orange 2 orange
7 300mhz apple apple
8 300-orange II orange
9 orange II HD orange
10 orange II tvx orange
11 orange 2 orange
12 HD berry-vol 2 berry
13 berry II berry
14 berry 2 berry
15 berry VI berry
16 berry 1 berry
17 berry II berry
18 berry berry
19 II Berry berry
attempt for the "version": keeping all other words
option 1
g = (df['words']
.str.extractall(r'\b([a-zA-Z] )\b')
.sort_values(by=0, key=lambda x: x.str.len())
.droplevel(1)
.groupby(level=0, group_keys=False)[0]
)
df['main'] = g.last().str.lower()
df['version'] = g.apply(lambda x: ' '.join(x.iloc[:-1]))
output:
words main version
0 apple II apple II
1 apple apple
2 apple 1 apple
3 HD APPLE apple HD
4 apple 3.5 apple
5 apple III apple III
6 orange 2 orange
7 300mhz apple apple
8 300-orange II orange II
9 orange II HD orange HD II
10 orange II tvx orange II tvx
11 orange 2 orange
12 HD berry-vol 2 berry HD vol
13 berry II berry II
14 berry 2 berry
15 berry VI berry VI
16 berry 1 berry
17 berry II berry II
18 berry berry
19 II Berry berry II
option2 (different regex and length computation)
g = (df['words']
.str.extractall(r'(\b\w \b)')
.sort_values(by=0, key=lambda x: x.str.replace('[^a-zA-Z]', '', regex=True)
.str.len())
.droplevel(1)
.groupby(level=0, group_keys=False)[0]
)
df['main'] = g.last().str.lower()
df['version'] = g.apply(lambda x: ' '.join(x.iloc[:-1]))
output:
words main version
0 apple II apple II
1 apple apple
2 apple 1 apple 1
3 HD APPLE apple HD
4 apple 3.5 apple 3 5
5 apple III apple III
6 orange 2 orange 2
7 300mhz apple apple 300mhz
8 300-orange II orange 300 II
9 orange II HD orange HD II
10 orange II tvx orange II tvx
11 orange 2 orange 2
12 HD berry-vol 2 berry 2 HD vol
13 berry II berry II
14 berry 2 berry 2
15 berry VI berry VI
16 berry 1 berry 1
17 berry II berry II
18 berry berry
19 II Berry berry II
CodePudding user response:
All the other answers omit the entry containing the word "adventures" because it throws off the search. You need a heuristic that can combine "longest" with "most frequent".
One thing that helps is that finding the longest word in each row greatly increases SNR. In other words, it filters out the unnecessary words pretty well, and just needs a little help. If you know how many words you are looking for (three in this case), you're all set:
from collections import Counter
common_long_words = [word.casefold() for word in (max(re.findall('\\w ', version), key=len) for version in list_of_versions)]
words = Counter(common_long_words).most_common(3)
Splitting off the version and finding the word of interest is not especially difficult. You have a couple of options regarding what constitutes a version, especially when the main word is embedded in the middle of the phrase. Here is a simple function that takes the entire remainder:
def split_main(version, words):
for word in words:
i = version.find(word)
if i > 0:
return word, f'{version[:i]} {version[i len(word)]}'
else:
raise ValueError(f'Version "{version}" does not contain any of the main words {{{", ".join(words)}}}')
result = {version: split_main(version, words) for version in list_of_versions}