I'm using a name entity recognition model to find names in a text string. For hyphenated names like Jane Miller-Smith, the NER model returns the names seperately like this:
names = ['Jane','Miller','-','Smith']
What's a simple way to join the items before and after the '-' to one string in this list? So that I have a list of first and last name like name = ['Jane', 'Miller-Smith']
?
I've so far tried to loop through the list of names based on solutions like this for different hyphenated name versions:
name1 = ['Jane', 'Miller', '-','Smith']
name = ['Jane', '-', 'Marie','Miller', '-','Smith']
new_name = []
for cur, nxt in zip (name, name [1:]):
print(cur,nxt)
if cur == '-':
hyph = cur nxt
new_name.append(hyph)
print("hyph: ", hyph)
else:
new_name.append(cur)
print("cur: ", cur)
print(new_name)
But I can't wrap my head around how to combine only the string before and after the hypen and also keep other non-hyphenated strings in the list in order (so that not the last name is suddenly first).
CodePudding user response:
Here the trick would be to join the list with a field delimiter you won't find in your list (e.g., |).
Then, you replace the pattern |-|
with -
and you split back using your field delimiter.
names = ['Jane', '-', 'Marie','Miller', '-','Smith']
print('|'.join(names).replace('|-|', '-').split('|'))
Output:
['Jane-Marie', 'Miller-Smith']
CodePudding user response:
Scan from right to left, replacing the three-element slices whenever a hyphen is found:
>>> names = ['Jane', '-', 'Marie','Miller', '-','Smith']
>>> for i in reversed(range(len(names))):
if names[i] == '-':
names[i-1: i 2] = [f'{names[i-1]}-{names[i 1]}']
>>> names
['Jane-Marie', 'Miller-Smith']
An alternative is to loop left-to-right and build a new result list:
>>> names = ['Jane', '-', 'Marie', 'Miller', '-','Smith']
>>> result = []
>>> it = iter(names)
>>> for tok in it:
if tok == '-':
tok = result.pop() '-' next(it)
result.append(tok)
>>> names
['Jane', '-', 'Marie', 'Miller', '-', 'Smith']
CodePudding user response:
Using an iterator and itertools
:
from itertools import chain, pairwise
# for python <3.10, check the pairwise recipe:
# https://docs.python.org/3/library/itertools.html#itertools.pairwise
# or iterator = zip(names, names[1:] [''])
names = ['Jane', '-', 'Marie', 'John', 'Miller', '-','Smith']
out = []
iterator = pairwise(chain(names, ['']))
for (a, b) in iterator:
if b == '-':
out.append(a next(iterator)[0] next(iterator)[0])
else:
out.append(a)
out
compact version:
iterator = pairwise(chain(names, ['']))
out = [a next(iterator)[0] next(iterator)[0] if b == '-' else a
for (a, b) in iterator]
output: ['Jane-Marie', 'John', 'Miller-Smith']
CodePudding user response:
YOu need to keep a stack, and keep check the -
symbol, if found then you need to join the previous word and next word into one
name = ['Jane', '-', 'Marie','Miller', '-','Smith']
result = []
for word in name:
if result and result[-1] !='-':
result.append(word)
else:
symbol = ''
if result:
symbol = result.pop()
word2 = ''
if result:
word2 = result.pop()
new_word = ''.join([word2, symbol, word])
result.append(new_word)
print(result)
output
['Jane-Marie', 'Miller-Smith']
CodePudding user response:
def solution(the_list: list[str]) -> list[str]:
while '-' in the_list:
hyphen_index = the_list.index('-')
text_before_hyphen = the_list[hyphen_index - 1]
text_after_hyphen = the_list[hyphen_index 1]
the_list.remove(text_before_hyphen)
the_list.remove('-')
the_list.remove(text_after_hyphen)
x = text_before_hyphen '-' text_after_hyphen
the_list.insert(hyphen_index - 1, x)
return the_list
print(solution(['Jane', 'Miller', '-', 'Smith']))
print(solution(['Jane', '-', 'Marie', 'Miller', '-', 'Smith']))
The output will be like this.
python3 main.py
['Jane', 'Miller-Smith']
['Jane-Marie', 'Miller-Smith']