Problem

I have a list of around ~1000 honorifics, see below for a sample.

Given an input string of a name, for example "her majesty queen elizabeth windsor", the function should return "elizabeth windsor". If there is no honorific present at the start of the name (to simplify the problem), the function should simple return the name itself (e.g. elizabeth windsor -> elizabeth windsor).

I have pretty intense latency constraints, so need to optimise this code as much as possible.

Working solution

Here is my working solution, there are some additional constraints to reduce false positives (for example lance is both an honorific and a first name), see the unit tests:

def strip_honorific(source: str, honorifics: List[str]) -> str:
    source_tokens = source.split()
    if len(source_tokens) > 2:
        for honorific in honorifics:
            if source.startswith(f"{honorific} "):
                stripped_source = source[len(honorific)   1 :]
                if len(stripped_source.split()) > 1:
                    return stripped_source
    return source

Unit tests

def test_honorifics():
    assert strip_honorific(source="her majesty queen elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
    assert strip_honorific(source="elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
    assert strip_honorific(source="mrs elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
    assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
    assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
    assert strip_honorific(source="her majesty queen", honorifics = honorifics) == "her majesty queen"
    assert strip_honorific(source="her majesty queen elizabeth", honorifics = honorifics) == "her majesty queen elizabeth"
    assert strip_honorific(source="kapitan fred", honorifics = honorifics) == "kapitan fred"
    
    
test_honorifics()

Benchmark

For a basic benchmark, I've used the below list of honorifics (minus the ellipses).

source_lst = [
    "her majesty queen elizabeth windsor",
    "mr fred wilson",
    "the rt hon nolan borak",
    "his most eminent highness simon smithson", 
    "kapteinis jurijs jakovļevs", 
    "miss nancy garland",
    "missnancy garland",
]


times = []
for _ in range(1000):
    for source in source_lst:
        t0 = time.time()
        strip_honorific(source=source, honorifics = honorifics)
        times.append(time.time() - t0)

print(f"Mean time: {sum(times)/ len(times)}s") # Mean time: 5.11584963117327e-06s

Honorifics list

honorifics = [
    "mr", 
    "mrs", 
    "the hon",
    "the hon dr",
    "the hon lady",
    "the hon lord",
    "the hon mrs",
    "the hon sir",
    "the honourable",
    "the rt hon",
    "her majesty queen",
    "his majesty king",
    "vina",
    "flottiljamiral",
    "superintendent",
    "rabbi",
    "diraja",
    "domnul",
    "kindralleitnant",
    "countess",
    "pan",
    "khatib",
    "zur",
    "vice",
    "don",
    "flotiles",
    "dipl",
    "his most eminent highness", 
    ...
    "the reverend", 
    "archbishop", 
    "sheik", 
    "shaikh", 
    "the rt hon lord", 
    "la tres honorable"
    "ekselence", 
    "kapteinis", 
    "kapitan", 
    "excellenza"
    "mr", 
    "mrs", 
    "miss"
]

CodePudding user response：

First of all I had a doubt about how the following input should be dealt with:

"the hon lady dana"

When the honorific "the hon lady" is taken, then it doesn't match, because "dana" is just one word, but when the shorter honorific "the hon" is taken, then it does match, and "lady dana" would be the stripped version, but it would be odd to have "lady" retained, since it clearly is part of a longer honorific.

I would think that because of the first, longer match, no other attempt should be made, and nothing should be stripped from the input string. I took that approach in my attempts below.

I will offer two alternatives:

Using regex
Using word-based trie

Using your benchmark there wasn't much difference, but for realistic data the ratio between matches and non-matches might have some influence on total running time.

Regex solution

Preprocessing:

honorifics_re = re.compile(fr"^(?:{'|'.join(sorted(honorifics, key=len, reverse=True))}) (\S ( \S)?.*)")

Actual function:

def strip_honorific(source: str, honorifics) -> str:
    m = honorifics.match(source)
    return m[1] if m and m[2] else source

Call with honorifics = honorifics_re

Trie solution

Preprocessing:

def make_trie(honorifics):
    root = {}
    for honorific in honorifics:
        node = root
        for word in honorific.split():
            if word not in node:
                node[word] = {}
            node = node[word]
        node["$$"] = len(honorific)   1
    return root

Actual function:

def strip_honorific(source: str, honorifics) -> str:
    node = honorifics
    for word in source.split():
        if word not in node:
            if "$$" in node:
                rest = source[node["$$"]:]
                if " " in rest:
                    return rest
            return source
        node = node[word]
    return source

Call with honorifics = make_trie(honorifics)

CodePudding user response：

I was able to improve the performance by >2x by reformatting the honorifics list into an alternative form.

def reformat_honorifics(honorifics):
    honorifics_by_letter_dct = {}
    for honorific in honorifics:
        honorifics_by_letter_dct[honorific[0]] = honorifics_by_letter_dct.get(honorific[0], {})
        
        honorifics_by_letter_dct[honorific[0]][len(honorific.split()[0])] = honorifics_by_letter_dct[honorific[0]].get(len(honorific.split()[0]), [])   [honorific]

    return honorifics_by_letter_dct    
    
reformatted_honorifics = reformat_honorifics(honorifics)


def strip_honorific_1(source: str, honorifics) -> str:
    source_tokens = source.split()
    if len(source_tokens) > 2:
        if honorifics.get(source_tokens[0][0]):
            if honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
                for honorific in honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
                    if source.startswith(f"{honorific} "):
                        stripped_source = source[len(honorific)   1 :]
                        if len(stripped_source.split()) > 1:
                            return stripped_source
    return source

Mean time: 2.1538e-06s