Python Regex, how to substitute multiple occurrences with a single pattern?-CodePudding

I'm trying to make a fuzzy autocomplete suggestion box that highlights searched characters with HTML tags

For example, if the user types 'ldi' and one of the suggestions is "Leonardo DiCaprio" then the desired outcome is "Leonardo DiCaprio". The first occurrence of each character is highlighted in order of appearance.

What I'm doing right now is:

def prototype_finding_chars_in_string():
    test_string_list = ["Leonardo DiCaprio", "Brad Pitt","Claire Danes","Tobey Maguire"]
    comp_string = "ldi" #chars to highlight
    regex = ".*?"   ".*?".join([f"({x})" for x in comp_string])   ".*?" #results in .*?(l).*?(d).*?(i).*
    regex_compiled = re.compile(regex, re.IGNORECASE)
    for x in test_string_list:
        re_search_result = re.search(regex_compiled, x) # correctly filters the test list to include only entries that features the search chars in order
        if re_search_result:
            print(f"char combination {comp_string} are in {x} result group: {re_search_result.groups()}")

results in

char combination ldi are in Leonardo DiCaprio result group: ('L', 'D', 'i')

Now I want to replace each occurrence in the result groups with [whatever in the result] and I'm not sure how to do it.

What I'm currently doing is looping over the result and using the built-in str.replace method to replace the occurrences:

def replace_with_bold(result_groups, original_string):
    output_string: str = original_string
    for result in result_groups:
        output_string = output_string.replace(result,f"<b>{result}</b>",1)
    
    return output_string

This results in:

Highlighted string: <b>L</b>eonar<b>d</b>o D<b>i</b>Caprio

But I think looping like this over the results when I already have the match groups is wasteful. Furthermore, it's not even correct because it checked the string from the beginning each loop. So for the input 'ooo' this is the result:

char combination ooo are in Leonardo DiCaprio result group: ('o', 'o', 'o')
Highlighted string: Le<b><b><b>o</b></b></b>nardo DiCaprio

When it should be Leonardo DiCaprio

Is there a way to simplify this? Maybe regex here is overkill?

CodePudding user response：

This should work:

for result in result_groups:
    output_string = re.sub(fr'(.*?(?!<b>))({result})((?!</b>).*)',
         r'\1<b>\2</b>\3',
         output_string,
         flags=re.IGNORECASE)

on each iteration first occurrence of result (? makes .* lazy this together does the magic of first occurrence) will be replaced by result if it is not enclosed by tag before ((?!) and (?!) does that part) and \1 \2 \3 are first, second and third group additionally we will use IGNORECASE flag to make it case insensitive.

CodePudding user response：

A way using re.split:

test_string_list = ["Leonardo DiCaprio", "Brad Pitt", "Claire Danes", "Tobey Maguire"]

def filter_and_highlight(strings, letters):
    
    pat = re.compile( '('   (')(.*?)('.join(letters))   ')', re.I)
    
    results = []
    
    for s in strings:
        parts = pat.split(s, 1)
        
        if len(parts) == 1: continue
        
        res = ''
        for i, p in enumerate(parts):
            if i & 1:
                p = '<b>'   p   '</b>'
                
            res  = p
            
        results.append(res)
        
    return results

filter_and_highlight(test_string_list, 'lir')

A particularity of re.split is that captures are included by default as parts in the result. Also, even if the first capture matches at the start of the string, an empty part is returned before it, that means that searched letters are always at odd indexes in the list of substrings.