Home > front end >  Find and remove slightly different substring on string
Find and remove slightly different substring on string

Time:03-24

I want to find out if a substring is contained in the string and remove it from it without touching the rest of the string. The thing is that the substring pattern that I have to perform the search on is not exactly what will be contained in the string. In particular the problem is due to spanish accent vocals and, at the same time, uppercase substring, so for example:

myString = 'I'm júst a tésting stríng'
substring = 'TESTING'

Perform something to obtain:

resultingString = 'I'm júst a stríng'

Right now I've read that difflib library can compare two strings and weight it similarity somehow, but I'm not sure how to implement this for my case (without mentioning that I failed to install this lib).

Thanks!

CodePudding user response:

This normalize() method might be a little overkill and maybe using the code from @Harpe at https://stackoverflow.com/a/71591988/218663 works fine.

Here I am going to break the original string into "words" and then join all the non-matching words back into a string:

import unicodedata
def normalize(text):
    return unicodedata.normalize("NFD", text).encode('ascii', 'ignore').decode('utf-8').lower()

myString = "I'm júst a tésting stríng"
substring = "TESTING"
newString = " ".join(word for word in myString.split(" ") if normalize(word) != normalize(substring))

print(newString)

giving you:

I'm júst a stríng

CodePudding user response:

You can use the package unicodedata to normalize accented letters to ascii code letters like so:

import unicodedata
output = unicodedata.normalize('NFD', "I'm júst a tésting stríng").encode('ascii', 'ignore')
print(str(output))

which will give

b"I'm just a testing string"

You can then compare this with your input

"TESTING".lower() in str(output).lower()

which should return True.

  • Related