Problem
I have a list of around ~1000 honorifics, see below for a sample.
Given an input string of a name, for example "her majesty queen elizabeth windsor"
, the function should return "elizabeth windsor"
. If there is no honorific present at the start of the name (to simplify the problem), the function should simple return the name itself (e.g. elizabeth windsor
-> elizabeth windsor
).
I have pretty intense latency constraints, so need to optimise this code as much as possible.
Working solution
Here is my working solution, there are some additional constraints to reduce false positives (for example lance
is both an honorific and a first name), see the unit tests:
def strip_honorific(source: str, honorifics: List[str]) -> str:
source_tokens = source.split()
if len(source_tokens) > 2:
for honorific in honorifics:
if source.startswith(f"{honorific} "):
stripped_source = source[len(honorific) 1 :]
if len(stripped_source.split()) > 1:
return stripped_source
return source
Unit tests
def test_honorifics():
assert strip_honorific(source="her majesty queen elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="mrs elizabeth windsor", honorifics = honorifics) == "elizabeth windsor"
assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
assert strip_honorific(source="mrselizabeth windsor", honorifics = honorifics) == "mrselizabeth windsor"
assert strip_honorific(source="her majesty queen", honorifics = honorifics) == "her majesty queen"
assert strip_honorific(source="her majesty queen elizabeth", honorifics = honorifics) == "her majesty queen elizabeth"
assert strip_honorific(source="kapitan fred", honorifics = honorifics) == "kapitan fred"
test_honorifics()
Benchmark
For a basic benchmark, I've used the below list of honorifics (minus the ellipses).
source_lst = [
"her majesty queen elizabeth windsor",
"mr fred wilson",
"the rt hon nolan borak",
"his most eminent highness simon smithson",
"kapteinis jurijs jakovļevs",
"miss nancy garland",
"missnancy garland",
]
times = []
for _ in range(1000):
for source in source_lst:
t0 = time.time()
strip_honorific(source=source, honorifics = honorifics)
times.append(time.time() - t0)
print(f"Mean time: {sum(times)/ len(times)}s") # Mean time: 5.11584963117327e-06s
Honorifics list
honorifics = [
"mr",
"mrs",
"the hon",
"the hon dr",
"the hon lady",
"the hon lord",
"the hon mrs",
"the hon sir",
"the honourable",
"the rt hon",
"her majesty queen",
"his majesty king",
"vina",
"flottiljamiral",
"superintendent",
"rabbi",
"diraja",
"domnul",
"kindralleitnant",
"countess",
"pan",
"khatib",
"zur",
"vice",
"don",
"flotiles",
"dipl",
"his most eminent highness",
...
"the reverend",
"archbishop",
"sheik",
"shaikh",
"the rt hon lord",
"la tres honorable"
"ekselence",
"kapteinis",
"kapitan",
"excellenza"
"mr",
"mrs",
"miss"
]
CodePudding user response:
First of all I had a doubt about how the following input should be dealt with:
"the hon lady dana"
When the honorific "the hon lady" is taken, then it doesn't match, because "dana" is just one word, but when the shorter honorific "the hon" is taken, then it does match, and "lady dana" would be the stripped version, but it would be odd to have "lady" retained, since it clearly is part of a longer honorific.
I would think that because of the first, longer match, no other attempt should be made, and nothing should be stripped from the input string. I took that approach in my attempts below.
I will offer two alternatives:
- Using regex
- Using word-based trie
Using your benchmark there wasn't much difference, but for realistic data the ratio between matches and non-matches might have some influence on total running time.
Regex solution
Preprocessing:
honorifics_re = re.compile(fr"^(?:{'|'.join(sorted(honorifics, key=len, reverse=True))}) (\S ( \S)?.*)")
Actual function:
def strip_honorific(source: str, honorifics) -> str:
m = honorifics.match(source)
return m[1] if m and m[2] else source
Call with honorifics = honorifics_re
Trie solution
Preprocessing:
def make_trie(honorifics):
root = {}
for honorific in honorifics:
node = root
for word in honorific.split():
if word not in node:
node[word] = {}
node = node[word]
node["$$"] = len(honorific) 1
return root
Actual function:
def strip_honorific(source: str, honorifics) -> str:
node = honorifics
for word in source.split():
if word not in node:
if "$$" in node:
rest = source[node["$$"]:]
if " " in rest:
return rest
return source
node = node[word]
return source
Call with honorifics = make_trie(honorifics)
CodePudding user response:
I was able to improve the performance by >2x by reformatting the honorifics list into an alternative form.
def reformat_honorifics(honorifics):
honorifics_by_letter_dct = {}
for honorific in honorifics:
honorifics_by_letter_dct[honorific[0]] = honorifics_by_letter_dct.get(honorific[0], {})
honorifics_by_letter_dct[honorific[0]][len(honorific.split()[0])] = honorifics_by_letter_dct[honorific[0]].get(len(honorific.split()[0]), []) [honorific]
return honorifics_by_letter_dct
reformatted_honorifics = reformat_honorifics(honorifics)
def strip_honorific_1(source: str, honorifics) -> str:
source_tokens = source.split()
if len(source_tokens) > 2:
if honorifics.get(source_tokens[0][0]):
if honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
for honorific in honorifics[source_tokens[0][0]].get(len(source_tokens[0])):
if source.startswith(f"{honorific} "):
stripped_source = source[len(honorific) 1 :]
if len(stripped_source.split()) > 1:
return stripped_source
return source
Mean time: 2.1538e-06s