Home > Back-end >  Numbering a certain word in a text
Numbering a certain word in a text

Time:01-30

I want to provide a reference number as (number) format to certain words in text.

I do get some correct output by using the code below. However when there are same words with adjectives or when the word has appendix it doesnt work.

All of the edge cases I could think of are these two, when there are same words including adjectives and then being able to match the word in dictionary if a word gets appendix in the text.

tried this,

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = r"\b" keyword r"\b"
    text = re.sub(pattern, keyword " (" str(number) ")", text)

print(text)

got this, This is a first sample (3) (1) and this is a second sample (3) (2).

instead of, This is a first sample (1) and this is a second sample (2).

CodePudding user response:

The problem here is that you are putting the keyword back in after matching it, so subsequent keywords which are (possibly) prefixes of a keyword can still be matched.

Consider what happens when you don't put the matched keyword back:

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = rf"\b{keyword}\b"
    text = re.sub(pattern, f"({number})", text)

print(text)  # This is a (1) and this is a (2).

To fix the issue, you could use the number as a placeholder and put each keyword back in a second for loop:

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = rf"\b{keyword}\b"
    text = re.sub(pattern, f"({number})", text)

print(text)  # This is a (1) and this is a (2).

for keyword, number in words_to_number.items():
    pattern = rf"\({number}\)"
    text = re.sub(pattern, f"{keyword} ({number})", text)

print(text)  # This is a first sample (1) and this is a second sample (2).

CodePudding user response:

As a single statement, make a single regex using | to separate the different regexs and use the callback option of re.sub.

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}


regex = r"|".join("({})".format(k) for k in words_to_number)

text_new = re.sub(regex, lambda m: r"{} ({})".format(
                    m.group(), words_to_number[m.group()]) , text)

print(text_new)

CodePudding user response:

I would personally drop regex altogether and use Fractalism's method as such:

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for word, number in words_to_number.items():
    text = text.replace(word, str(number))

for word, number in words_to_number.items():
    text = text.replace(str(number), f"{word} ({number})")

Regex seems overkill in this situation as you are only matching against predefined strings without additional patterns.

  • Related