For instance, given a list article_list and disease_list, which are defined as follows:
article_list = ['Tricuspid', 'valve', 'regurgitation', 'and', 'lithium', 'carbonate',
'toxicity', 'in', 'a', 'newborn', 'infant', '.']
disease_list = ['Tricuspid valve regurgitation', 'lithium carbonate', 'toxicity']
As you can see article_list is a list of tokens of an article. However, I would like to tokenize disease names (which often consist of several words) as one token instead of several. That means to replace the tokens of article_list with the tokens in disease list on the right places.
The function should be something like this
def insert_disease_tokens(article_list, disease_list):
# do something
print(insert_disease_tokens(article_list, disease_list))
>>> ['Tricuspid valve regurgitation', 'and', 'lithium carbonate','toxicity', 'in', 'a', 'newborn', 'infant', '.']
Any idea with an efficient way to do that in Python? I only came up with unnecessarily complicated solutions.
CodePudding user response:
You can replace all the spaces in both the text and the disease with a placeholder temporarily, then control how it gets replaced back (in case of normal text leave placeholder while in the ngrams to be merged return to space), then split back to original list format by splitting by that placeholder. Pad start and end with the placeholder to include the cases of sentence starting or ending with a term.
As placeholder you can use '_' as @Ritwick Jha suggested or as I did in the example '__' to be a bit more sure it won't naturally occur in texts or terms.
import re
def insert_disease_tokens(article_list_, disease_list_, ph='__'):
# join to one string using placeholders and pad with placeholders
article_list_ = ph ph.join(article_list_) ph
# for each disease: if its (placeholder-replaced) version occurs in the text, replace it by space-separated version
for d in disease_list_:
article_list_ = re.sub(ph d.replace(" ", ph) ph, ph d ph, article_list_)
# split by the placeholder, meaning everything will get split except the space-separated found terms, also remove padding again
return article_list_[len(ph):-len(ph)].split(ph)
article_list = insert_disease_tokens(article_list, disease_list)
Which returns:
> ['Tricuspid valve regurgitation', 'and', 'lithium carbonate', 'toxicity', 'in', 'a', 'newborn', 'infant', '.']
You can of course use a completely different approach as this is not a string-specific problem but rather a general one (how to replace specific sub-sequences in a list) and for large data that will probably be much more efficient, but the indexing can get pretty complicated. Check out e.g. this thread.
CodePudding user response:
As I commented just reverse engineering that suggestion gives the solution. basically, make data, replace space with '' and then do white space tokenization. this won't work if the data contains '' on its own.
article_list = ['Tricuspid', 'valve', 'regurgitation', 'and', 'lithium', 'carbonate',
'toxicity', 'in', 'a', 'newborn', 'infant', '.']
disease_list = ['Tricuspid valve regurgitation', 'lithium carbonate', 'toxicity']
data = ' '.join(article_list)
for disease in disease_list:
d = disease.replace(' ', '_')
data = data.replace(disease, d)
tokens = data.split()
for i,t in enumerate(tokens):
t = t.replace('_', ' ')
tokens[i] = t
print(tokens)
"""
['Tricuspid valve regurgitation', 'and', 'lithium carbonate', 'toxicity',
'in', 'a', 'newborn', 'infant', '.']
"""