Find most common words from list of strings-CodePudding

We have a given list:

list_of_versions = ['apple II' ,'apple', 'apple 1' , 'HD APPLE','apple 3.5', 'adventures of apple'  , 'apple III','orange 2' ,'300mhz apple', '300-orange II' , 'orange II HD' , 'orange II tvx', 'orange 2' , 'HD berry-vol 2', 'berry II', 'berry 2', 'berry VI', 'berry 1', 'berry II' ,'berry' ,'II Berry']

How can I find the main word of each string? For example:

word            | main
--------------------------
apple II        |apple
val/apple       |apple
apple 1         |apple
HD APPLE        |apple
apple 3.5       |apple
adventures of apple |apple
apple III       |apple
300mhz apple    |apple
orange 2        |orange
300-orange II   |orange
orange II HD    |orange
/orange/II-tvx  |orange
orange 2        |orange
HD berry-vol 2  |berry
berry-II        |berry
-berry-2        |berry
(berry) VI      |berry
berry 1         |berry
berry II        |berry
berry 2022 B8   |berry
II Berry-hd     |berry
22 Berry II     |berry
Berry 6.8.9     |berry

Important points:

I can not create the main word list which contains three main words (apple, orange, berry). Because the list will be updated with new main words. So we will never know what is the new words.
there is no limit to versions. At some point, we can see something like 'apple XII' or 'GB-HD berry 2.4' so version value can be everything. (in case you want to create a stopword list)

Nice to have (but it is not mandatory)--> Adding another column as well which shows the version. i.e:

word            | main   | version
-----------------------------------
apple II        |apple   | II 
val/apple       |apple   | NULL
apple 1         |apple   | 1
HD APPLE        |apple   | HD
apple 3.5       |apple   | 3.5
apple III       |apple   | III
300mhz apple II |apple   | II
orange 2        |orange  | 2
300-orange II   |orange  | II
orange II HD    |orange  | II HD
/orange/II-tvx  |orange  | II tvx
orange 2        |orange  | 2
HD berry-vol 2  |berry   | 2 HD
berry-II        |berry   | II
-berry-2        |berry   | 2
(berry) VI      |berry   | VI
berry 1         |berry   | 1
berry II        |berry   | II  
berry 2022      |berry   | NULL
II Berry-hd     |berry   | II HD
22 Berry        |berry   | 22
Berry 6.8.9     |berry   | 6.8.9

CodePudding user response：

I propose following heursitic for your task: find longest sequence of letters, which can be implemented using re module following way

import re
list_of_versions = ['apple II' ,'apple', 'apple 1' , 'HD APPLE','apple 3.5', 'apple III','orange 2' ,'300mhz apple', '300-orange II' , 'orange II HD' , 'orange II tvx', 'orange 2' , 'HD berry-vol 2', 'berry II', 'berry 2', 'berry VI', 'berry 1', 'berry II' ,'berry' ,'II Berry']
def get_main(string):
    return max(re.findall(r'[A-Za-z] ',string),key=len)
for version in list_of_versions:
    print(version,'|',get_main(version))

output

apple II | apple
apple | apple
apple 1 | apple
HD APPLE | APPLE
apple 3.5 | apple
apple III | apple
orange 2 | orange
300mhz apple | apple
300-orange II | orange
orange II HD | orange
orange II tvx | orange
orange 2 | orange
HD berry-vol 2 | berry
berry II | berry
berry 2 | berry
berry VI | berry
berry 1 | berry
berry II | berry
berry | berry
II Berry | Berry

Warning: this solution is limited to ASCII letters and was prepared using your example data only, please test it with all data you have access to in order to detect if does return what you want frequently enough for your use case.

CodePudding user response：

As suggested in the comment, you can get the longest string:

df['main'] = (df['words']
 .str.extractall('([a-zA-Z] )')
 .sort_values(by=0, key=lambda x: x.str.len())
 .groupby(level=0).last()
 [0].str.lower() # optional
)

output:

             words    main
0         apple II   apple
1            apple   apple
2          apple 1   apple
3         HD APPLE   apple
4        apple 3.5   apple
5        apple III   apple
6         orange 2  orange
7     300mhz apple   apple
8    300-orange II  orange
9     orange II HD  orange
10   orange II tvx  orange
11        orange 2  orange
12  HD berry-vol 2   berry
13        berry II   berry
14         berry 2   berry
15        berry VI   berry
16         berry 1   berry
17        berry II   berry
18           berry   berry
19        II Berry   berry

attempt for the "version": keeping all other words

option 1

g = (df['words']
     .str.extractall(r'\b([a-zA-Z] )\b')
     .sort_values(by=0, key=lambda x: x.str.len())
     .droplevel(1)
     .groupby(level=0, group_keys=False)[0]
    )

df['main'] = g.last().str.lower()
df['version'] = g.apply(lambda x: ' '.join(x.iloc[:-1]))

output:

             words    main version
0         apple II   apple      II
1            apple   apple        
2          apple 1   apple        
3         HD APPLE   apple      HD
4        apple 3.5   apple        
5        apple III   apple     III
6         orange 2  orange        
7     300mhz apple   apple        
8    300-orange II  orange      II
9     orange II HD  orange   HD II
10   orange II tvx  orange  II tvx
11        orange 2  orange        
12  HD berry-vol 2   berry  HD vol
13        berry II   berry      II
14         berry 2   berry        
15        berry VI   berry      VI
16         berry 1   berry        
17        berry II   berry      II
18           berry   berry        
19        II Berry   berry      II

option2 (different regex and length computation)

g = (df['words']
     .str.extractall(r'(\b\w \b)')
     .sort_values(by=0, key=lambda x: x.str.replace('[^a-zA-Z]', '', regex=True)
                                       .str.len())
     .droplevel(1)
     .groupby(level=0, group_keys=False)[0]
    )

df['main'] = g.last().str.lower()
df['version'] = g.apply(lambda x: ' '.join(x.iloc[:-1]))

output:

             words    main   version
0         apple II   apple        II
1            apple   apple          
2          apple 1   apple         1
3         HD APPLE   apple        HD
4        apple 3.5   apple       3 5
5        apple III   apple       III
6         orange 2  orange         2
7     300mhz apple   apple    300mhz
8    300-orange II  orange    300 II
9     orange II HD  orange     HD II
10   orange II tvx  orange    II tvx
11        orange 2  orange         2
12  HD berry-vol 2   berry  2 HD vol
13        berry II   berry        II
14         berry 2   berry         2
15        berry VI   berry        VI
16         berry 1   berry         1
17        berry II   berry        II
18           berry   berry          
19        II Berry   berry        II

CodePudding user response：

All the other answers omit the entry containing the word "adventures" because it throws off the search. You need a heuristic that can combine "longest" with "most frequent".

One thing that helps is that finding the longest word in each row greatly increases SNR. In other words, it filters out the unnecessary words pretty well, and just needs a little help. If you know how many words you are looking for (three in this case), you're all set:

from collections import Counter

common_long_words = [word.casefold() for word in (max(re.findall('\\w ', version), key=len) for version in list_of_versions)]
words = Counter(common_long_words).most_common(3)

Splitting off the version and finding the word of interest is not especially difficult. You have a couple of options regarding what constitutes a version, especially when the main word is embedded in the middle of the phrase. Here is a simple function that takes the entire remainder:

def split_main(version, words):
    for word in words:
        i = version.find(word)
        if i > 0:
            return word, f'{version[:i]} {version[i   len(word)]}'
    else:
        raise ValueError(f'Version "{version}" does not contain any of the main words {{{", ".join(words)}}}')

result = {version: split_main(version, words) for version in list_of_versions}