I am trying to generate a list that contains anchor
names in w:hyperlink
elements by looping over all document's elements using the python-docx
library, with this code:
def get_hyperlinks(docx__document):
hyperlinks_in_document = list()
for counter, element in enumerate(docx__document.elements):
if isinstance(element, Paragraph) and not element.is_heading:
hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
return list(set(hyperlinks_in_document))
The above code returns a list with anchors found the issue I'm having is when a text is separated into multiple runs therefore a list "generated from looping into element" can have duplicated names and the output is being like this:
['American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian']
I tried these codes from here but still with the issue of duplicate or performance of code is affected but this code here:
def get_hyperlinks(docx__document):
hyperlinks_in_document = list()
returned_links = list()
for counter, element in enumerate(docx__document.elements):
if isinstance(element, Paragraph) and not element.is_heading:
hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
[returned_links.append(element_in_list) for element_in_list in hyperlinks_in_document
if element_in_list not in returned_links]
return returned_links
solve the issue of duplicate but the performance is affected. any ideas that can help?
CodePudding user response:
I made changes with the previous code and figured out to switch the final list to set therefore I got non-duplicate items with less time:
def get_hyperlinks(docx__document):
hyperlinks, returned_links = list(), set()
for counter, element in enumerate(docx__document.elements):
if isinstance(element, Paragraph) and not element.is_heading:
hyperlinks = element._p.getparent().xpath('.//w:hyperlink')
hyperlinks = [str(hyperlink.get(qn("w:anchor"))) for hyperlink in hyperlinks]
returned_links = list(set().union(hyperlinks))
# [returned_links.append(element_in_list) for element_in_list in hyperlinks
# if element_in_list not in returned_links]
return returned_links
Commented lines show what I did before and the whole answer is the final code.
CodePudding user response:
return list(dict.fromkeys(youdublicated_list))