Home > Net >  Searching for a word/phrase in a string with all the possible approximations of the phrase
Searching for a word/phrase in a string with all the possible approximations of the phrase

Time:11-17

Suppose I have the following string:

string = 'machine learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'

Further suppose that I have a tag defined as:

tag = 'machine learning'

Now I wish to find the tag in my string. As you can see from my string there are three places that I have machine learning, one at the beginning of the string and one as machine12 learning and the last one as machines learning. I wish to find all of these and make an output list as

['machine learning', 'machine12 learning', 'machines learning']

To be able to do this I was tried to tokenize my tag using nltk. That is

tag_token = nltk.word_tokenize(tag)

I would then have ['machine','learning']. I would then search for tag[0].

I know that string.find(tag_token[0]) and data.rfind(tag_token[0]) would give the position of machine for the first and last finds, but what if I had more machine learning within the text (here we have 3)?

In that case I would not be able to extract them all. So my original idea to find all the occurrences of machine and then learning would have failed. I wished to use fuzzywuzzy to then analyze the ['machine learning', 'machine12 learning', 'machines learning'] with respect to the tag.

So my question is given then string I have, how can I search for the tag and its approximations and list them as follow?

['machine learning', 'machine12 learning', 'machines learning']

Update: I now know that I can do the followings:

pattern = re.compile(r"(machine[\s0-9] learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machine12 learning']

also if I do

pattern = re.compile(r"(machine[\sA-Za-z] learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machines learning']

But certainly, this is not a generalizable solution as it stands. So I wonder if there is a smart way to search in such scenarios?

CodePudding user response:

Maybe use pattern like this (string\w*)?

import re

string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'

tag_token=['machine','learning']

pattern='(' ''.join(e '\w*\s (?:\S*\s )?' for e in tag_token)[:-14] ')'

rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning']

it will be more difficult to find matches with the changing position of words in the tag

and this code will find all combinations from tag_token. E.g. machine s learning and machine learning and... machine12 12 learning. Also you can create new string and new tag_token that containing more than 2 words. All combinations of these words will be found. tag_token = ['1', '2', '3'] will match 1 2 3 and 1a 2 b 3 and 2b2 1sss 3 and 333 2tt 1

import re
import itertools

string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good. Learning machine can be used to train people. learning the machines is a great job'

tag_token=['machine','learning']

pattern='('
for current_tag in itertools.permutations(tag_token, len(tag_token)):
    pattern =''.join(e '\w*\s (?:\S*\s )?' for e in current_tag)[:-14] '|'

pattern=pattern.rstrip('|') ')'
rgx=re.compile(pattern,re.IGNORECASE)

rgx.findall(string)

#output
#['machine 12 learning', 'machine12 learning', 'machines learning', 'Learning machine', 'learning the machines']
  • Related