Home > Net >  Generate non-duplicate list in python
Generate non-duplicate list in python

Time:03-07

I am trying to generate a list that contains anchor names in w:hyperlink elements by looping over all document's elements using the python-docx library, with this code:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
    return list(set(hyperlinks_in_document))

The above code returns a list with anchors found the issue I'm having is when a text is separated into multiple runs therefore a list "generated from looping into element" can have duplicated names and the output is being like this:

['American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian', 'American', 'Syrian']

I tried these codes from here but still with the issue of duplicate or performance of code is affected but this code here:

def get_hyperlinks(docx__document):
    hyperlinks_in_document = list()
    returned_links = list()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks_in_document.extend(element._element.xpath('//w:hyperlink/@w:anchor'))
            [returned_links.append(element_in_list) for element_in_list in hyperlinks_in_document
             if element_in_list not in returned_links]
    return returned_links

solve the issue of duplicate but the performance is affected. any ideas that can help?

CodePudding user response:

I made changes with the previous code and figured out to switch the final list to set therefore I got non-duplicate items with less time:

def get_hyperlinks(docx__document):    
    hyperlinks, returned_links = list(), set()
    for counter, element in enumerate(docx__document.elements):
        if isinstance(element, Paragraph) and not element.is_heading:
            hyperlinks = element._p.getparent().xpath('.//w:hyperlink')
    hyperlinks = [str(hyperlink.get(qn("w:anchor"))) for hyperlink in hyperlinks]
    returned_links = list(set().union(hyperlinks))
    # [returned_links.append(element_in_list) for element_in_list in hyperlinks
    #          if element_in_list not in returned_links]
    return returned_links

Commented lines show what I did before and the whole answer is the final code.

CodePudding user response:

return list(dict.fromkeys(youdublicated_list))

  • Related