Why comparing the length of sets to remove duplicates in a doubly linked list doesn't work? (py-CodePudding

so recently I was learning python data structure, more specifically, the "doubly linked list". I came across a problem where I need to write code to remove duplicates in a doubly-linked list and I tried to use set() since it doesn't allow duplicates. I compared the length of the sets before adding the new node from the linked list and after adding that node to see if there were any changes with the length, if no, deleting that node since the length didn't change implying that the node is a duplicate. Unexpectedly, this method just doesn't work for some reason.

Here's my remove_duplicate function:

def remove_duplicates(self):
    cur = self.head
    seen = set()
    while cur:
      temp = cur
      prev_len = len(seen)
      seen.add(cur)
      after_len = len(seen)
      if prev_len == after_len:
        self.delete_node(cur)
        cur = temp.next
      else:
        cur = cur.next

Here's the delete_node function:

def delete_node(self, node):
    cur = self.head
    while cur:
      if cur == node and cur == self.head:
        # Case 1:
        if not cur.next:
          cur = None 
          self.head = None
          return

        # Case 2:
        else:
          nxt = cur.next
          cur.next = None 
          nxt.prev = None
          cur = None
          self.head = nxt
          return 

      elif cur == node:
        # Case 3:
        if cur.next:
          nxt = cur.next 
          prev = cur.prev
          prev.next = nxt
          nxt.prev = prev
          cur.next = None 
          cur.prev = None
          cur = None
          return

        # Case 4:
        else:
          prev = cur.prev 
          prev.next = None 
          cur.prev = None 
          cur = None 
          return 
      cur = cur.next

And the node class and linked list class are:

class Node:
  def __init__(self, data):
    self.data = data
    self.next = None
    self.prev = None


class DoublyLinkedList:
  def __init__(self):
    self.head = None

  def append(self, data):
    if self.head is None:
      new_node = Node(data)
      self.head = new_node
    else:
      new_node = Node(data)
      cur = self.head
      while cur.next:
          cur = cur.next
      cur.next = new_node
      new_node.prev = cur

  def prepend(self, data):
    if self.head is None:
      new_node = Node(data)
      self.head = new_node
    else:
      new_node = Node(data)
      self.head.prev = new_node
      new_node.next = self.head
      self.head = new_node

  def print_list(self):
    cur = self.head
    while cur:
      print(cur.data)
      cur = cur.next

I've been stuck on this issue for quite a while, so any advice and corrections on my remove_duplicate function will be very appreciated!

CodePudding user response：

There are two issues with your code. The first issue is that you're adding the Node instance, not the value within the node, to the seen set. Unless your linked list is circular, you're not going to run into the exact same node again, so you'll never see any duplicates this way.

Try adding the data attribute to the set instead of the node itself:

  prev_len = len(seen)
  seen.add(cur.data)      # add the data attribute, not the node itself to the set
  after_len = len(seen)

As was pointed out in the comments on your question, checking the length of the set is sort of a round-about way of membership testing. A simpler way would be to directly check cur.data in seen.

The other issue is more subtle, and has to do with how you delete your nodes and how you move to the next entry in the list, after finding and deleting your first duplicate.

Your code does:

    temp = cur

and then later:

    self.delete_node(cur)
    cur = temp.next

The problem is that self.delete_node modifies the node being deleted by setting its prev and next links to None. Since temp is just a reference to the node cur (not a copy), it sees the modified value, and so you can never delete more than one node from the list.

What would work better is saving the next node before the deletion, and then going directly to it:

temp = cur.next
self.delete_node(cur)
cur = temp

But really, you could avoid the issue by making delete_node not bother messing with the node being removed, since it's no longer linked to by any of the other nodes. The delete_node method could be radically simplified in a number of other ways into something like this:

def delete_node(self, node):   # no need to search for the node, we've already got it!
    if node.next:
        node.next.prev = node.prev

    if node.prev:
        node.prev.next = node.next
    else:
        self.head = node.next

With this deletion logic, you'd no longer need special case code for getting the next node in remove_duplicates, you would just unconditionally move on to cur.next.

def remove_duplicates(self):
    cur = self.head
    seen = set()

    while cur:
        if cur.data not in seen:
            seen.add(cur.data)
        else:
            self.delete_node(cur)

        cur = cur.next            # this now works even if we deleted cur