I am using python to parse a news article and obtain a set of people names contained within it. Currently every Named Entity classified as a PERson (by Stanford's Stanza NLP library) gets added to a set as follows:
maxnames = set() # initialize an empty set for PER references
for entity in doc.entities:
if entity.type == "PER":
if entity.text not in maxnames:
maxnames.add(entity.text)
Here is a real example I end up with:
{'von der Leyen', 'Meloni', 'Lars Danielsson', 'Filippo Mannino', 'Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}
What I'm trying to achieve is to keep on the most complete name. In the above example this should become:
{'Lars Danielsson', 'Filippo Mannino', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}
because in the first set:
- 'von der Leyen' should be suppressed by 'Ursula von der Leyen'
- 'Meloni' suppressed by 'Giorgia Meloni' and so on.
This is how I'm trying but am getting lost :( Can you please spot the error?
def longestname(reference: str, nameset: set[str]) -> set[str]:
"""
Return the longest name in a set of names
"""
for name in nameset.copy():
lenname = len(name)
lenref = len(reference)
if lenref < lenname:
if reference in name:
nameset.add(name)
else:
nameset.remove(name)
nameset.add(reference)
return nameset
nameset = set()
nameset = longestname("von der Leyen", nameset)
nameset = longestname("Meloni", nameset)
nameset = longestname("Lars Danielsson", nameset)
nameset = longestname("Lars", nameset)
nameset = longestname("Giorgia Meloni", nameset)
nameset = longestname("Ursula von der Leyen", nameset)
nameset = longestname("Giorgia", nameset)
print(nameset)
# should contain exactly:
# {'Lars Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen'}
CodePudding user response:
This isn't the most efficient solution (O(N^2)), but if the number of names isn't huge I don't think striving for maximum efficiency is that important.
>>> names = {'von der Leyen', 'Meloni', 'Lars Danielsson', 'Filippo Mannino', 'Danielsson', 'Giorgia Meloni', 'Ursula von der Leyen', 'Matteo Piantedosi', 'Lamberto Giannini'}
>>> {name for name in names if not any(
... name in other and name != other for other in names
... )}
{'Matteo Piantedosi', 'Lars Danielsson', 'Ursula von der Leyen', 'Filippo Mannino', 'Lamberto Giannini', 'Giorgia Meloni'}
A more efficient solution might involve building a dictionary keyed on space-separated words so you can narrow down the possible set of matches instead of doing an O(N) search each time -- however this gets a little tricky if you have overlaps (say you have "Jean-Claude Van Damme" and "Dick Van Dyke" both in the same article) so I leave figuring that out as an exercise for the reader.